Analysis_molf

Abstract—The effects of transforming the net function
vector in the multilayer perceptron are analyzed. The use
of optimal diagonal transformation matrices on the net
function vector is proved to be equivalent to training the
network using multiple optimal learning factors
(MOLF). A method for linearly compressing large ill-
conditioned MOLF Hessian matrices into smaller well-
conditioned ones is developed. This compression
approach is shown to be equivalent to using several
hidden units per learning factor. The technique is
extended to large networks. In simulations, the proposed
algorithm performs almost as well as the Levenberg
Marquardt algorithm with the computational complexity
of a first order training algorithm.
I. INTRODUCTION
Multilayer perceptrons (MLPs) have found wide
acceptance in a variety of applications ranging from
parameter estimation [1][2], document analysis and
recognition [3], finance, manufacturing [4] and data mining
[5]. The dynamically tuned basis functions of the MLP along
with its approximation capabilities [6] [7] make it an
effective tool for classification and approximation. The
relatively short evaluation time makes it a better choice than
many other classification and approximation methods such
as support vector machines (SVM) [8]. The universal
approximation [9] property of the MLP along with its ability
to mimic Bayes discriminant [10], optimal L2 norm
estimates [11] and maximum a-posteriori (MAP) [12]
estimates give it a strong theoretical foundation.
Despite of the above advantages there are limitations due
to the lack of scalable and effective training algorithms.
Training algorithms for the MLP can be briefly classified as
first and second order algorithms. First order algorithms like
back-propagation (BP) [13] and output weight optimization-
backpropagation (OWO-BP) require fewer multiplies and
data passes and hence take less time per iteration. However,
they are sensitive to input means and gains [14]. Note that
conjugate gradient (CG) [15] has the same sensitivity. CG
can be considered first order as the error function is not
quadratic. Second order methods on the other hand, are not
as sensitive to input means and gains, but they do have their
own inherent limitations. Second order algorithms related to
Manuscript received February 10, 2011.
P. Jesudhas, M. Manry, and R. Rawat are with the University of Texas at
Arlington, TX 76010, USA (e-mail: manry@uta.edu).
S. Malalur is with O-Geo Well LLC in Arlington, Texas (e-mail:
sanjeev.malalur@gmail.com).
Newton’s method often have non-positive definite or
singular Hessian matrices [16][17] which result in unstable
training. Hence the Levenberg-Marquardt (LM) algorithm
[18][19] is used instead. The LM algorithm is very
computationally intensive due to the large size of the
Hessian matrix so its use in training large networks is often
limited.
In [20] a two-stage training algorithm known as multiple
optimal learning factors (MOLF-BP) is found to produce
results comparable to the LM algorithm with computations
similar to first order training algorithms. However, the
MOLF-BP approach still has certain limitations. The MOLF
Hessian can be ill-conditioned. The size of the MOLF
Hessian matrix Hmolf can also become prohibitively large for
larger networks, resulting in scalability problems.
This paper gives a strong theoretical foundation to the
basic MOLF algorithm by relating it to optimally
transforming the net function vector of the MLP. For large
networks, the scalability issues of the basic MOLF algorithm
are dealt with by reducing the MOLF Hessian matrice's size
via a reduction in the number of optimal learning factors.
The organization of this paper is as follows. Section II
starts with an introduction to the MLP notation along with
the basic back-propagation training of the MLP. Section III
includes a review of the MOLF procedure in detail. An
analysis of the existing MOLF procedure with equivalent
networks is provided in section IV. The idea of collapsing
the MOLF Hessian matrix is explained in section V along
with its advantages. Section VI presents experimental
results.
II. THE MULTILAYER PERCEPTRON
A. Notation
In the fully connected MLP of Fig. 1, input weights
w(k,n) connect the nth
input to the kth
hidden unit. Output
weights woh(m,k) connect the kth
hidden unit’s activation
op(k) to the mth
output yp(m), which has a linear activation.
The bypass weight woi(m,n) connects the nth
input to the mth
output. The training data, described by the set {xp,tp}
consists of N-dimensional input vectors xp and M-
dimensional desired output vectors, tp. The pattern number p
varies from 1 to Nv, where Nv denotes the number of training
vectors present in the data set.
In order to handle thresholds in the hidden and output
layers, the input vectors are augmented by an extra element
as xp = [xp(1), xp(2),…., xp(N+1)]T
where xp(N+1) = 1. Let
Nh denote the number of hidden units. The dimensions of the
weight matrices W, Woh and Woi are respectively Nh by
(N+1), M by Nh and M by (N+1). Note that bypass weights
Analysis and Improvement of Multiple Optimal Learning Factors
for Feed-Forward Networks
Praveen Jesudhas, Michael T. Manry, Rohit Rawat, and Sanjeev Malalur
Proceedings of International Joint Conference on Neural Networks, San Jose, California, USA, July 31 – August 5, 2011
978-1-4244-9637-2/11/$26.00 ©2011 IEEE 2593

Fig. 1. A fully connected Multilayer Perceptron
woi relieve the hidden units of the burden of approximating
the linear part of the desired mapping. This reduces the
necessary number of hidden units. The vector of hidden
layer net functions, np and the actual output of the network,
yp can be written as
np = W · xp (1)
yp
= Woi · xp + Woh · op
where the kth
element of the hidden unit activation vector op
is calculated as op(k) = f(np(k)) and f(.) denotes the hidden
layer activation function. Training an MLP typically
involves minimizing the mean squared error between the
desired and the actual network outputs, defined as
E =
1
Nv
[ tp i - yp
i ]
2
M
i=1
Nv
p=1
=
1
Nv
Ep,
Nv
p=1
(2)
Ep = [ tp i - yp
i ]
2
M
i=1
where Ep is the cumulative squared error for pattern p.
B. Discussion of Back-Propagation
BP training [13, 22, 23, 24] is a gradient based learning
algorithm which involves changing system parameters so
that the output error is reduced. For a MLP, the expression
for actual output found from the input is given by equation
(1). The weights of the MLP are changed by BP so that the
error E of equation (2), which computes the sum of the
squared errors between the actual and desired outputs,
decreases. The weights are updated using gradients of the
error with respect to the corresponding weights.
The required negative error gradients are calculated using
the chain rule utilizing delta functions, which are negative
gradients of Ep with respect to net functions. For the pth
pattern, output and hidden layer delta functions [13][21] are
respectively found as,
δpo i = 2 tp i - yp
i 3
δp k = f' np k δpo
M
i=1
(i)woh(i,k)
Now, the negative gradient of E with respect to w(k,n) is,
g k,n =
-∂E
∂w(k,n)
=
1
Nv
δp
Nv
p=1
k xp n (4)
The matrix of negative partial derivatives can therefore be
written as,
G =
1
Nv
δp
Nv
p=1
· xp
T
(5)
where δp = [δp(1) , δ p(2)… , δ p(Nh) ]T
.
If steepest descent is used to modify the hidden weights,
W is updated in a given iteration as,
W ← W + ∆ W, (6)
∆ W = z·G
where z is the learning factor, which can be obtained
optimally from the expression,
z =
-∂E/ ∂z
∂2
E/ ∂z2
(7)
III. REVIEW OF MOLF ALGORITHM
The MOLF algorithm makes use of the BP procedure to
update the input layer weights W, where a different learning
factor zk is used to update weights feeding into the kth
hidden
unit. The output weights Woh and bypass weights Woi are
found using output weight optimization (OWO).
Output weight optimization (OWO) is a technique for
finding weights connected to the outputs of the network.
Since the outputs have linear activations, finding the weights
connected to the outputs is equivalent to solving a system of
linear equations. The expression for the outputs given in (1)
can be re-written as
yp
= Wo· xp (8)
where xp=[ T
, T
]
T
is the augmented input vector of length
Nu where Nu equals N + Nh + 1, Wo is formed as [Woi : Woh]
and has a size of M by Nu . The output weights can be solved
for by setting ∂E ∂Wo⁄ =0 which leads to a set of linear
equations given by,
Ca= Ra·Wo
T
(9)
where,
Ca=
1
Nv
xp
Nv
p=1
· tp
T
, Ra=
1
Nv
xp·
Nv
p=1
xp
T
Equation (9) is most easily solved using orthogonal least
squares (OLS) [22].
The input weight connecting the nth
input to the kth
hidden
unit is updated using,
w k,n ← w k,n + zk·g k,n (10)
where, zk denotes the learning factor corresponding to the kth
hidden unit. The vector z containing the different learning
xp(1)
xp(2)
xp(3)
xp(N+1)
yp(1)
yp(2)
yp(3)
yp(M)
W Woh
np(1) op(1)
op(Nh)
Woi
Input
Layer
Hidden
Layer
Output
Layer
np(Nh)
2594

factors zk can be found using OLS from the following
relation,
Hmolf·z = gmolf
(11)
gmolf
j =
-∂E
∂zj
, hmolf k,j =
∂2
E
∂zk∂zj
IV. ANALYSIS OF THE MOLF ALGORITHM
In this section, the structure of the MLP is analyzed in
detail using equivalent networks. The relationship between
MOLF and linear transformations of the net function vector
is investigated.
A. Discussion of Equivalent Networks
Let MLP 1, have a net function vector np and output vector
yp after back propagation training as discussed previously in
section II-B. In a second network called MLP 2 trained
similarly, the net function vector and the output vector are
respectively denoted by and yp′. The bypass and output
weight matrix along with the hidden unit activation functions
are considered to be equal for both the MLP 1 and MLP 2.
Based on this information the output vectors for MLP 1 and
MLP 2 are respectively,
yp
i = woi
N+1
n=1
i,n xp n + woh
Nh
k=1
i,k f np k (12)
y'p i = woi
N+1
n=1
i,n xp n + woh
Nh
k=1
i,k f n k (13)
In order to make MLP 2 strongly equivalent to MLP 1
their respective output vectors yp and yp′ have to be equal.
Based on equations (12) and (13) this can be achieved only
when their respective net function vectors are equal [23].
MLP 2 can be made strongly equivalent to MLP 1 by
linearly transforming its net function vector before
passing it as an argument to the activation function f(.) given
by,
np = C · (14)
where C is a linear transformation matrix of size Nh by Nh.
On applying the linear transformation to the net function
vector in equation (13) we get,
y'p i = yp
(i) = woi
N+1
n=1
i,n xp n +
woh(i,k)
Nh
k=1
f c(m,k)
Nh
m=1
·n k (15)
After making MLP 2 strongly equivalent to MLP 1, the
net function vector np can be related to its input weights W
and the net function vector as,
np(k) = w(k,m)xp
N+1
n=1
(m) = c(k,m)
Nh
m=1
n (m) (16)
Similarly, the MLP 2 net function vector can be related
to its weights W′ as,
n k = w k,m
N+1
n=1
xp m (17)
On substituting (17) into (16) we get,
np k c k,m
Nh
m 1
N 1
n 1
w m,n xp n (18)
It is observed above that the net function of MLP 1 is
found by linear transformation of the input weight matrix W′
of MLP 2 given by,
W = C · W′ (19)
If the elements of the C matrix are found through an
optimality criterion, then optimal input weights W′ can be
computed with the input weight matrix W of MLP 1.
B. Optimal transformation of net function
The discussion of equivalent networks in section IV-A
suggests that, an optimal set of net functions can be obtained
by linear transformation of the input weights W. In this
section, the input weight update equations are derived for
OWO-BP training, using the linear transformation matrix C
from MLP 2.
MLP 1 and MLP 2 are strongly equivalent as before. Then
output of the two networks after transformation is given as
in equation (15). The elements of the negative gradient
matrix G′ of MLP 2 found from equation (2) and (15) is
defined as,
g u,v =
-∂E
∂w'(u,v)
=
2
Nv
[ tp
M
i=1
Nv
p=1
i - yp
i ]·
∂yp
(i)
∂w'(u,v)
(20)
∂yp
(i)
∂w'(u,v)
= woh
Nh
k=1
(i,k) o'p(k) c(k,u) xp(v)
Rearranging terms in (20) results in,
g u,v =
c k,u
Nh
k=1
2
Nv
[ tp
M
i=1
Nv
p=1
i - yp
i ]woh i,k o k xp v
= c(k,u)
-∂E
∂w(k,v)
Nh
k=1
(21)
which is abbreviated as,
G′ = CT
· G (22)
The input weight update equation for MLP 2 based on its
gradient matrix G′ is given by,
W′ = W′ + z · G′ (23)
On pre-multiplying this equation by C and using equations
(19) and (22) we obtain
2595

W = W + z · G′′ (24)
where,
G′′ = R · G, R = C · CT
Equation (24) is nothing but the weight update equation of
MLP 1, which would result in the network being strongly
equivalent to MLP 2. Thus using equation (24), the input
weights of MLP 1 can be updated so that its net functions
are optimal as in MLP 2.
Lemma 1: For a given R matrix, there are an uncountably
infinite number of C matrices.
The transformed gradient G′′ of MLP 1 is given in terms
of the original gradient G as,
g u,v = r (u,k) g(k,v) (25)
Nh
k=1
g k,v =
-∂E
∂w(k,v)
=
2
Nv
[ tp
M
i=1
Nv
p=1
i - yp
i ]woh i,k o'p k xp v (26)
Equations (24) and (25) suggest that MLP 1 could be
trained with optimal net functions using only the knowledge
of the linear transformation matrix R.
C. Multiple optimal learning factors
Equations (24) and (25) give a method for optimally
transforming the net functions of an MLP by using the
transformation matrix R. In this subsection, the MOLF
approach is derived from equations (24) and (25). Let the R
matrix in equation (24) be diagonal. In this case equation
(24) becomes,
w(k,n) = w(k,n) + z · r(k)g(k,n) (27)
where rk denotes the kth
diagonal element of R. On
comparing equations (10) and (27), the expression for the
optimal learning factors zk could be given as,
zk = z · r(k) (28)
Equation (28) suggests that using the MOLF algorithm for
training a MLP is equivalent to optimally transforming the
net function vector using a diagonal transformation matrix.
V. EFFECTS OF COLLAPSING THE MOLF HESSIAN
This section presents a computational analysis of the
existing MOLF algorithm followed by a proposed method to
collapse the MOLF Hessian matrix to create fewer optimal
learning factors.
A. Computational Analysis of the MOLF algorithm
In the existing MOLF algorithm a significant amount of
the computational burden is attributed to inverting the
Hessian matrix to find the optimal learning factors, usually
through the OLS procedure. The size of the Hessian matrix
which is based on the number of hidden units in the network
directly affects the computational load of the MOLF
algorithm.
The gradient and Hessian equations of the MOLF
algorithm are given in equation (11).The computation of the
multiple optimal learning factor vector z requires that
equation (11) be solved. The number of multiplies required
for computing z through OLS is,
Mols-molf (Nh
+1) Nh Nh+ 2 (29)
Equation (29) shows that the number of multiplications
required increases cubically with the number of hidden units.
For networks having large values of Nh, this can be time
consuming since solving (11) has to be done for every
iteration of the training algorithm. If some hidden units have
linearly dependent outputs [20], the MOLF Hessian matrix
Hmolf can be ill conditioned. This can affect MLP training.
Thus having the number of optimal learning factors equal
to the number of hidden units can result in a high
computational load and also could lead to an ill-conditioned
MOLF Hessian matrix.
B. Training with several hidden units per OLF
In this section, we modify the MOLF approach so that
each OLF is assigned to one or more hidden units. The
Hessian for this new approach is compared to Hmolf.
Let NOLF be the number of optimal learning factors used
for training in the proposed variable optimal learning factors
(VOLF) method. NOLF is selected such that it divides Nh,
with no remainder. Each optimal learning factor zv then
applies to Nh/NOLF hidden units. The MLP output based on
these conditions is given by,
yp
m = woi
N+1
n=1
m,n xp n +
woh m,k f(
Nh
k=1
(w k,n +zvg(k,n)xp n )
N+1
n=1
) 30
v = ceil ( k · (NOLF/Nh) )
where the function ceil() rounds non-integer arguments up to
the next integer.
The negative gradient of the error in equation (2) with
respect to each of the optimal learning factors zv, denoted as
gvolf is computed based on equation (30) as,
gvolf
j =
-∂E
∂zj
=
2
Nv
tp m – woh(m,k)op(zv,k)
Nh
k=1
M
m=1
Nv
p=1
·
woh(m,k(j,c))o p(k(j,c))n p(k(j,c))
Nh NOLF⁄
c=1
(31)
where,
tp m tp m woi
N+1
n=1
m,n xp n ,
k j,c = c + (j-1)·
Nh
NOLF
,
op(zv,k) = f( (w k,n + zv
N+1
n=1
g(k,n))xp(n))
The Hessian matrix Hvolf is derived from (2), (30) and (31)
as,
hvolf i,j =
∂2
E
∂zi∂zj
2596

=
2
Nv
woh(m,k i,d )o p(k i,d )n'p(k i,d ·
Nh NOLF⁄
d=1
M
m=1
Nv
p=1
woh(m,k j,c )o'
p(k j,c )n'p(k j,c )
Nh NOLF⁄
c=1
(32)
The vector z of variable optimal learning factors is found
from the negative gradient vector and Hessian matrix from
the relation,
Hvolf · z = gvolf (33)
where Hvolf is NOLF by NOLF and gvolf is a column vector of
dimension NOLF. Thus by varying the number of optimal
learning factors required for training, the vector z is found
using the method discussed in section III.
The computation required for solving equation (33) can be
adjusted by choosing the number of optimal learning factors.
When NOLF equals one, the current procedure is similar to
using a fixed optimal learning factor as discussed in section
II. In this case, it requires less computation but the algorithm
is also not very effective. When NOLF equals Nh, then the
algorithm reduces to the MOLF algorithm described in
section 3. Thus by varying the number of optimal learning
factors between one and Nh, the algorithm interpolates
between the MOLF and OLF cases.
C. Collapsing the MOLF Hessian
Occasionally the MOLF approach can fail because of
distortion in Hmolf due to linearly dependent inputs [20] or
because it is ill-conditioned due to linearly dependent hidden
units [20]. In these cases we don’t have to redo the current
training iteration. Instead we can collapse Hmolf and gmolf
down to a smaller size and use the approach of the previous
subsection. The elements of the original MOLF Hessian
matrix Hmolf for a MLP with Nh hidden units are denoted as,
Hmolf=
hmolf(1,1) hmolf(1,Nh)
hmolf(2,1)
hmolf(3,1)
hmolf(Nh,1) hmolf(Nh,Nh)
(34)
collapsing the Hessian matrix Hmolf to a matrix Hmolf1 of size
NOLF by NOLF we get,
Hmolf1=
hmolf1(1,1) hmolf1(1,NOLF)
hmolf1(2,1)
hmolf1(3,1)
hmolf1(NOLF,1) hmolf1(NOLF,NOLF)
35
The relationship between the elements of Hmolf1 and Hmolf is
described by,
hmolf1(m,n)= hmolf(
Nh NOLF⁄
d=1
Nh NOLF⁄
c=1
k(m,c),k(n,d)) 36
Equations (11) and (32) show that collapsing the MOLF
Hessian matrix to size NOLF by NOLF results in a VOLF
Hessian matrix of that size. Thus the Hmolf1 Hessian matrix
of equation (35) is the equivalent to the VOLF Hessian
matrix Hvolf of equation (32).
On collapsing the negative gradient vector gmolf to a vector
gmolf1 having NOLF elements we get,
gmolf1
= gmolf1
1 , gmolf1
2 ,………… gmolf1
(NOLF)
T
(37)
The relationship between gmolf1 and gmolf is described by,
gmolf1
(m)= gmolf
(k(m,c) )
Nh NOLF⁄
c=1
(38)
Equations (11) and (31) show that collapsing the MOLF
gradient vector to size NOLF by 1 results in the VOLF
gradient vector.
Therefore collapsing the MOLF Hessian matrix Hmolf and
gradient vector gmolf is equivalent to training the MLP with
the VOLF algorithm.
VI. RESULTS
A. Computational Burden of Different Algorithms
In this section the computational burden is calculated for
different MLP training algorithms in terms of required
multiplies. All the algorithms were implemented in
Microsoft Visual Studio 2005.
Let Nu = N + Nh + 1 denote the number of weights
connected to each output. The total number of weights in the
network is denoted as Nw = M(N + Nh + 1) + Nh(N + 1).The
number of multiplies required to solve for output weights
using Orthogonal Least Squares [22] is given by,
Mols Nu M Nu 2
3
4
Nu Nu 1 39
The numbers of multiplies required per training iteration
using BP, OWO-BP and LM are respectively given by,
Mbp= Nv · MNu + 2Nh N+1 + M(N+6Nh+4) +Nw 40
Mowo-bp= Nv[2Nh N+2 +M Nu+1 +
Nu N+1
2
+ M(N+6Nh+4)]+ Mols+ Nh(N+1) (41)
Mlm= Mbp + Nv MNu Nu+ 3Nh N+1 + 4Nh
2
(N+1)2
+ Nw
2
+ Nw
3
42
Equation (44) gives the multiplies required for computing
the optimal learning factors of the MOLF algorithm.
Similarly the number of multiplies required for the VOLF
algorithm is given as,
Mols-volf (NOLF
+1) NOLF NOLF+ 2 (43)
The total number of multiplies required for the MOLF and
VOLF algorithms are respectively given as,
Mmolf = Mowo-bp+ Nv Nh N+4 - M(7N-Nh+4) + (Nh)3
+ Mols-molf (44)
Mvolf = Mowo-bp+ Nv Nh N+4 - M(7N-Nh+4) + (Nh)3
+ Mols-volf (45)
Equations (44) and (45), show that the number of
multiplies required in the MOLF and VOLF algorithms
consists of the OWO-BP multiplies, multiplies to compute
the Hessian and negative gradient matrix, along with the
number of multiplies to invert the Hessian matrix. The
matrices Hmolf and Hvolf are respectively inverted in the
MOLF and VOLF algorithms.
2597

B. Experimental results
Here the performance of VOLF is compared with those of
MOLF, OWO-BP, LM and CG. In CG and LM, all weights
are varied in every iteration. In MOLF, VOLF and OWO-BP
we first solve linear equations for the output weights and
subsequently update the input weights.
The data sets used for the simulations are listed in Table I.
The training for all the datasets are done on inputs
normalized to zero mean and unit variance.
The optimal number of hidden units to be used in the
MLP is determined by network pruning using the method of
[24]. Then the k-fold validation procedure is used to obtain
the average training and validation errors. In k-fold
validation, the data set is split into k non-overlapping parts
of equal size, and (k − 1) parts are used for training and the
remaining one part is used for validation. The procedure is
Fig. 2. Twod data set: average error vs. number of iterations.
Fig. 3. Twod data set: average error vs. number of multiplies.
repeated k times picking a different part for use as the
validation set every time. For the simulations, k is chosen as
10. The number of iterations used for OWO-BP, CG, MOLF
and VOLF is chosen as 4000. For the LM algorithm 300
iterations are performed. For the VOLF algorithm NOLF =
Nh/2.
The average training error and the number of multiplies is
calculated for every iteration in a particular dataset using the
different training algorithms. These measurements are then
plotted to provide a graphical representation of the
efficiency and quality of the different training algorithms.
These plots for different datasets are shown below.
For the Twod.tra data file [25], the MLP is trained with 30
hidden units. In Fig. 2, the average mean square error (MSE)
for training from 10-fold validation is plotted versus the
number of iterations for each algorithm (shown on a log10
scale). In Fig. 3, the average training MSE from 10-fold
validation is plotted versus the required number of multiplies
(shown on a log10 scale).The LM algorithm shows the
lowest error of all the algorithms used for training, but its
computational burden may make it unsuitable for training
purposes. It is noted that the average error of the VOLF
algorithm lies between that of the OWO-BP and MOLF
algorithms for every iteration.
Fig. 4. Single2 data set: average error vs. number of iterations.
Fig. 5. Single2 data set: average error vs. number of multiplies.
10
0
10
1
10
2
10
3
10
4
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
Number of iterations
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
6
10
7
10
8
10
9
10
10
10
11
10
12
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
Number of multiplies
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
0
10
1
10
2
10
3
10
4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
7
10
8
10
9
10
10
10
11
10
12
10
13
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
MSE
VOLF
MOLF
LM
CG
OWO-BP
TABLE I
Data set description
Data Set Name No. of Inputs No. of Outputs No. of Patterns
Twod 8 7 1768
Single2 16 3 10000
Power12 12 1 1414
Concrete 8 1 1030
2598

Fig. 6. Power12 data set: Average error vs. number of iterations.
Fig. 7. Power12 data set: average error vs. number of multiplies.
For the Single2.tra data file [25], the MLP is trained with
20 hidden units. In Fig. 4, the average mean square error
(MSE) for training from 10-fold validation is plotted versus
the number of iterations for each algorithm (shown on a
log10 scale). In Fig. 5, the average training MSE from 10-
fold validation is plotted versus the required number of
multiplies (shown on a log10 scale). In the initial iterations,
the MOLF algorithm shows the least error. But overall, as
for the previous dataset, the LM algorithm shows the lowest
error of all the algorithms used for training. As before the
average error of the VOLF algorithm lies between that of the
OWO-BP and MOLF algorithms.
For the Power12trn.tra data file [25], the MLP is trained
with 25 hidden units. In Fig. 6, the average mean square
error (MSE) for training from 10-fold validation is plotted
versus the number of iterations for each algorithm (shown on
a log10 scale). In Fig. 7, the average training MSE from 10-
fold validation is plotted versus the required number of
multiplies (shown on a log10 scale). For this dataset the
MOLF algorithm performs better than all the other
Fig. 8. Concrete data set: average error vs. number of iterations.
Fig. 9. Concrete data set: average error vs. number of multiplies.
algorithms. As seen previously, the computational
requirements of the LM algorithm is very high. As before
the average error of the VOLF algorithm lies between that of
the OWO-BP and MOLF algorithms for each iteration.
For the Concrete data file [26], the MLP is trained with 15
hidden units. In Fig. 8, the average mean square error (MSE)
for training from 10-fold validation is plotted versus the
number of iterations for each algorithm (shown on a log10
scale).
In Fig. 9, the average training MSE from 10-fold
validation is plotted versus the required number of multiplies
(shown on a log10 scale). As before, the average error of the
VOLF algorithm lies between that of the OWO-BP and
MOLF algorithms for every iteration.
Table II compares the average training and validation
errors of the MOLF and VOLF algorithms with the other
algorithms for different data files. For each data set, the
average training and validation errors are found after 10-fold
validation.
10
0
10
1
10
2
10
3
10
4
4000
4500
5000
5500
6000
6500
7000
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
6
10
7
10
8
10
9
10
10
10
11
10
12
4000
4500
5000
5500
6000
6500
7000
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
0
10
1
10
2
10
3
10
4
10
20
30
40
50
60
70
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
5
10
6
10
7
10
8
10
9
10
10
10
11
10
20
30
40
50
60
70
MSE
VOLF
MOLF
LM
CG
OWO-BP
2599

From the plots and the table presented, it can be inferred
that the error performance of the VOLF algorithm lies
between those of MOLF and OWO-BP algorithms. The
VOLF and MOLF algorithms are also found to produce
good results approaching those from the LM algorithm, with
computational requirements only in the order of first order
training algorithms.
VII. CONCLUSION
In this paper we have put the MOLF training algorithm on
a firm theoretical footing by showing its derivation from an
optimal linear transformation of the hidden layer's net
function vector. We have also developed the VOLF training
algorithm, which associates several hidden units with each
optimal learning factor. This modification of the MOLF
approach will allow us to train much larger networks
efficiently.
REFERENCES
[1] R. C. Odom, P. Pavlakos, S.S. Diocee, S. M. Bailey, D. M. Zander,
and J. J. Gillespie, “Shaly sand analysis using density-neutron
porosities from a cased-hole pulsed neutron system,” SPE Rocky
Mountain regional meeting proceedings: Society of Petroleum
Engineers, pp. 467-476, 1999.
[2] A. Khotanzad, M. H. Davis, A. Abaye, D. J. Maratukulam, “An
artificial neural network hourly temperature forecaster with
applications in load forecasting,” IEEE Transactions on Power
Systems, vol. 11, No. 2, pp. 870-876, May 1996.
[3] S. Marinai, M. Gori, G. Soda, “Artificial neural networks for
document analysis and recognition,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 27, No. 1, pp. 23-35, 2005.
[4] J. Kamruzzaman, R. A. Sarker, R. Begg, “Artificial Neural Networks:
Applications in Finance and Manufacturing,” Idea Group Inc (IGI),
2006.
[5] L. Wang and X. Fu, Data Mining With Computational Intelligence,
Springer-Verlag, 2005.
[6] K. Hornik, M. Stinchcombe, and H.White, “Multilayer feedforward
networks are universal approximators.” Neural Networks, Vol. 2, No.
5, 1989, pp.359-366.
[7] K. Hornik, M. Stinchcombe, and H. White, “Universal approximation
of an unknown mapping and its derivatives using multilayer
feedforward networks,” Neural Networks,vol. 3,1990, pp.551-560.
[8] C. Cortes and V. Vapnik. Support-vector networks. Machine
Learning, 20:273-297, November 1995.
[9] G. Cybenko, “Approximations by superposition of a sigmoidal
function,” Mathematics of Control, Signals, and Systems (MCSS),
vol. 2, pp. 303-314, 1989.
[10] D. W. Ruck et al., “The multi-layer perceptron as an approximation to
a Bayes optimal discriminant function,” IEEE Transactions on Neural
Networks, vol. 1, No. 4, 1990.
[11] Michael T.Manry, Steven J.Apollo, and Qiang Yu, “Minimum mean
square estimation and neural networks,” Neurocomputing, vol. 13,
September 1996, pp.59-74.
[12] Q. Yu, S.J. Apollo, and M.T. Manry, “MAP estimation and the multi-
layer perceptron,” Proceedings of the 1993 IEEE Workshop on
Neural Networks for Signal Processing, pp. 30-39, Linthicum Heights,
Maryland, Sept. 6-9, 1993.
[13] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, “Learning internal
representations by error propagation,” in D.E. Rumelhart and J.L.
McClelland (Eds.), Parallel Distributed Processing, vol. I, Cambridge,
Massachusetts: The MIT Press, 1986.
[14] Changhua Yu, Michael T. Manry, and Jiang Li, "Effects of
nonsingular pre-processing on feed-forward network training ".
International Journal of Pattern Recognition and Artificial Intelligence
, Vol. 19, No. 2 (2005) pp. 217-247.
[15] J.P. Fitch, S.K. Lehman, F.U. Dowla, S.Y. Lu, E.M. Johansson, and
D.M. Goodman, "Ship Wake-Detection Procedure Using Conjugate
Gradient Trained Artificial Neural Networks," IEEE Trans. on
Geoscience and Remote Sensing, Vol. 29, No. 5, September 1991, pp.
718-726.
[16] S. McLoone and G. Irwin, “A variable memory Quasi-Newton
training algorithm,” Neural Processing Letters, vol. 9, pp. 77-89,
1999.
[17] A. J. Shepherd Second-Order Methods for Neural Networks, Springer-
Verlag New York, Inc., 1997.
[18] K. Levenberg, “A method for the solution of certain problems in least
squares,” Quart. Appl. Math., Vol. 2, pp. 164.168, 1944.
[19] D. Marquardt, “An algorithm for least-squares estimation of nonlinear
parameters,” SIAM J. Appl. Math., Vol. 11, pp. 431.441, 1963.
[20] S. S. Malalur, M. T. Manry, "Multiple optimal learning factors for
feed-forward networks," accepted by The SPIE Defense, Security and
Sensing (DSS) Conference, Orlando, FL, April 2010
[21] T.H. Kim “Development and evaluation of multilayer perceptron
training algorithms”, Phd. Dissertation, The University of Texas at
Arlington, 2001.
[22] W. Kaminski, P. Strumillo, “Kernel orthonormalization in radial basis
function neural networks,” IEEE Transactions on Neural Networks,
vol. 8, Issue 5, pp. 1177 - 1183, 1997
[23] R.P Lippman, “An introduction to computing with Neural Nets,”
IEEE ASSP Magazine,April 1987.
[24] Pramod L. Narasimha, Walter H. Delashmit, Michael T. Manry, Jiang
Li, Francisco Maldonado, “An integrated growing-pruning method for
feedforward network training,” Neurocomputing vol. 71, pp. 2831–
2847, 2008.
[25] Univ. of Texas at Arlington, Training Data Files –
http://wwwee.uta.edu/eeweb/ip/training_data_files.html.
[26] Univ. of California, Irvine, Machine Learning Repository -
http://archive.ics.uci.edu/ml/
TABLE II
Average 10-fold training and validation error
Data Set OWO-BP CG VOLF MOLF LM
Twod
Etrn 0.22 0.20 0.18 0.17 0.16
Eval 0.25 0.22 0.21 0.20 0.17
Single2
Etrn 0.83 0.63 0.51 0.16 0.03
Eval 1.02 0.99 0.64 0.19 0.11
Power12
Etrn 6959.89 6471.98 5475.01 4177.45 4566.34
Eval 7937.21 7132.13 6712.34 5031.23 5423.54
Concrete
Etrn 38.98 19.88 21.81 21.56 16.98
Eval 65.12 56.12 37.23 37.21 33.12
2600

Analysis_molf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Analysis_molf

Similar to Analysis_molf (20)

Analysis_molf