SlideShare a Scribd company logo
1 of 12
Download to read offline
Multiple optimal learning factors for the multi-layer perceptron
Sanjeev S. Malalur, Michael T. Manry n
, Praveen Jesudhas
University of Texas at Arlington, Department of Electrical Engineering, Nedderman Hall, Room 517, Arlington, TX 76019, USA
a r t i c l e i n f o
Article history:
Received 10 April 2010
Received in revised form
14 June 2014
Accepted 21 August 2014
Communicated by: G. Thimm
Available online 11 September 2014
Keywords:
Multilayer perceptron
Newton's method
Hessian
Orthogonal least squares
Multiple optimal learning factor
Whitening transform
a b s t r a c t
A batch training algorithm is developed for a fully connected multi-layer perceptron, with a single
hidden layer, which uses two-stages per iteration. In the first stage, Newton's method is used to find a
vector of optimal learning factors (OLFs), one for each hidden unit, which is used to update the input
weights. Linear equations are solved for output weights in the second stage. Elements of the new
method's Hessian matrix are shown to be weighted sums of elements from the Hessian of the whole
network. The effects of linearly dependent inputs and hidden units on training are analyzed and an
improved version of the batch training algorithm is developed. In several examples, the improved
method performs better than first order training methods like backpropagation and scaled conjugate
gradient, with minimal computational overhead and performs almost as well as Levenberg–Marquardt, a
second order training method, with several orders of magnitude fewer multiplications.
& 2014 Elsevier B.V. All rights reserved.
1. Introduction
Multi-layer perceptron (MLP) neural networks are widely used
for regression and classification applications in the areas of
parameter estimation [1,2], document analysis and recognition
[3], finance and manufacturing [4] and data mining [5]. Due to its
layered, parallel architecture, the MLP has several favorable
properties such as universal approximation [6] and the ability to
mimic Bayes discriminant [7] and maximum a-posteriori (MAP)
estimates [8]. However, actual MLP performance is adversely
affected by the limitations of the available training algorithms.
Common batch training algorithms for the MLP include first
order methods such as backpropagation (BP) [9] and conjugate
gradient [10] and second order learning methods related to New-
ton's method. First order methods generally use fewer operations
per iteration but require more iterations than second order methods.
Newton's method for training the MLP often has non-positive
definite [11,12] or even singular Hessians. Hence Levenberg–Mar-
quardt (LM) [13,14] and other quasi-Newton methods are used
instead. Unfortunately, second order methods do not scale well.
Although first order methods scale better, they are sensitive to input
means and gains [15], since they lack affine invariance. Layer-by-
layer approaches [16] also exist for optimizing the MLP. Such
approaches (i) divide the network weights into disjoint subsets,
and (ii) train the subsets separately. Some network weight sub-
sets have nonsingular Hessians, allowing them to be trained via
Newton's method. For example, solving linear equations for output
weights [15,17–19] is actually Newton's algorithm for the output
weights. The optimal learning factor (OLF) [20] for BP training is
found using a one by one Hessian. If the learning factor and
momentum term gain in BP are found using a two by two Hessian
[21], the resulting algorithm is very similar to conjugate gradient.
The primary purpose of this paper is to present a family of
learning algorithms targeted towards training a fixed architecture,
fully connected multi-layer perceptron with a single hidden layer,
capable of learning from regression/approximation type application
and data. In [22] a two-stage training method was introduced, which
uses Newton's method to obtain a vector of optimal learning factors,
one for each MLP hidden unit. These learning factors are optimal
hence it was named multiple optimal learning factors (MOLF).
A variation to MOLF called the variable optimal learning factors
(VOLF) was presented in [23]. As a learning method, VOLF looks
promising. However the original MOLF was still better in terms of
training performance. In this paper, we explain in detail the motiva-
tion behind the original MOLF algorithm using the concept of
equivalent networks. We analyze the structure of the MOLF Hessian
matrix and demonstrate how linear dependency among inputs and
hidden units can impact the structure, ultimately affecting training
performance. This leads to the development of an improved version
of the MOLF whose performance is not impacted by the presence of
linear dependencies and has an overall performance better than the
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/neucom
Neurocomputing
http://dx.doi.org/10.1016/j.neucom.2014.08.043
0925-2312/& 2014 Elsevier B.V. All rights reserved.
n
Corresponding author. Present address: Department of Electrical Engineering, The
University of Texas at Arlington, Arlington, TX 76019, USA. Tel.: þ1 817 272 3483.
E-mail addresses: Sanjeev.malalur@gmail.com (S.S. Malalur),
manry@uta.edu (M.T. Manry).
Neurocomputing 149 (2015) 1490–1501
original MOLF. The rest of the paper is organized as follows. Section 2
reviews MLP notation, a simple first order training method, and an
expression for the OLF. The MOLF method is presented in Section 3
motivated by a discussion of equivalent networks. Section 4
analyzes the effect of linearly dependent inputs and hidden units
on MOLF learning. An improvement to the MOLF algorithm is
presented in Section 5 that is immune to the presence of linearly
dependent inputs and hidden units. In Section 6, our methods are
compared to various existing first and second order algorithms.
Section 7 summarizes our conclusions.
2. Review of multi-layer perceptron
First, we introduce the notation used in the rest of the paper
and review a simple two-stage algorithm that makes limited use of
Newton's method. Observations from the two-stage method are
used to motivate the MOLF approach.
2.1. MLP notation
A fully connected feed-forward MLP network with a single
hidden layer is shown in Fig. 1. It consists of a layer for inputs, a
single layer of hidden units and a layer of outputs. The input
weights w(k, n) connect the nth input to the kth hidden unit.
Output weights woh(m, k) connect the kth hidden unit's activation
op(k) to the mth output yp(m), which has a linear activation.
The bypass weight woi(m, n) connects the nth input to the mth
output. The training data, described by the set {xp, tp} consists of
N-dimensional input vectors xp and M-dimensional desired output
vectors, tp. The pattern number p varies from 1 to Nv, where Nv
denotes the number of training vectors present in the data set.
In order to handle thresholds in the hidden and output layers, the
input vectors are augmented by an extra element xp(Nþ1) where,
xp(Nþ1)¼1, so xp¼[xp(1), xp(2),…, xp(Nþ1)]T
. Let Nh denote the
number of hidden units. The vector of hidden layer net functions, np
and the output vector of the network, yp can be written as
np ¼ W Uxp;
yp ¼ Wio U xp þ Woh Uop ð1Þ
where the kth element of the hidden unit activation vector op is
calculated as op(k)¼f(np(k)) and f(∙) denotes the hidden layer
activation function such as the sigmoid activation [9], which is the
one used throughout this paper. In this paper, the cost function used
in MLP training is the classic mean squared error between the
desired and the actual network outputs, defined as
E ¼
1
Nv
∑
Nv
p ¼ 1
∑
M
m ¼ 1
½tp mð ÞÀyp mð ÞŠ2
ð2Þ
2.2. Training using output weight optimization-backpropagation
Here, we introduce output weight optimization-backpropagation
[17] (OWO-BP), a first order algorithm that groups the weights in the
network into two subsets and trains them separately. In a given
iteration of OWO-BP, the training proceeds from (i) finding the
output weight matrices, Woh and Woi connected to the network
outputs to (ii) separately training the input weights W using BP.
Output weight optimization (OWO) is a technique for calculating
the network's output weights [18]. From (1) it is clear that our
MLP's output layer activations are linear. Therefore, finding the
weights connected to the outputs is equivalent to solving a system
of linear equations. The expression for the actual outputs given in
(1) can be re-written as
yp ¼ Wo Uxp ð3Þ
where xp ¼[xT
p, oT
p]T
is the augmented input column vector or basis
vector of size Nu where Nu ¼NþNhþ1. The M by Nu output weight
matrix Wo, formed as [Woi: Woh], contains all the output weights.
The output weights can be found by setting ∂E/∂Wo¼0 which
leads to a set of linear equations given by
Cx ¼ Rx UWT
o ð4Þ
where
Cx ¼
1
Nv
∑
Nv
p ¼ 1
xp UtT
p
Rx ¼
1
Nv
∑
Nv
p ¼ 1
xp UxT
p
Eq. (4) comprises of M sets of Nu linear equations in Nu unknowns and
is most easily solved using orthogonal least squares (OLS) [24]. In the
second half of an OWO-BP iteration, the input weight matrix W is
updated as
W’W þzUG ð5Þ
where G is the generic representation of a direction matrix that
contains information about the direction of learning and the learning
factor, z, contains information about the step length to be taken in the
direction G. For backpropagation, the direction matrix is nothing but
the Nh by (Nþ1) negative input weight Jacobian matrix computed as
G ¼
∂E
∂W
G ¼
1
Nv
∑
Nv
p ¼ 1
δp UxT
p ð6Þ
Here δp¼[δp(1), δp(2),…, δp(Nh)]T
is the Nh by 1 column vector of
hidden unit delta functions [9]. A brief outline of training using
OWO-BP is given below. For every training iteration,
1. In the first stage, solve the system of linear equations in (4)
using OLS and update the output weights, Wo;
2. In the second stage, begin by finding the negative Jacobian
matrix G described in Eq. (6);
3. Then, update the input weights, W, using Eq. (5).
This method is attractive for a couple of reasons. First, the
training is faster than BP, since training weights connected to theFig. 1. A fully connected multi-layer perceptron.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1491
outputs is equivalent to solving linear equations [16] and second, it
helps us avoid some local minima.
2.3. Optimal learning factor
The choice of learning factor z in Eq. (5) has a direct effect on
the convergence rate of OWO-BP. Early steepest descent methods
used a fixed constant, with slow convergence, while later methods
used a heuristic scaling approach to modify the learning factor
between iterations and thus speed up the rate of convergence.
However, using Taylor's series for the error E, in Eq. (2), a non-
heuristic optimal learning factor (OLF) for OWO-BP can be derived
[20] as,
z ¼
À∂E=∂z
∂2
E=∂z2





z ¼ 0
ð7Þ
where the numerator and denominator derivatives are evaluated
at z¼0. The expression for the second derivative of the error with
respect to the OLF is found using (2) as,
∂2
E
∂z2
¼ ∑
Nh
k ¼ 1
∑
Nh
j ¼ 1
∑
N þ 1
n ¼ 1
gðk; nÞU ∑
N þ 1
i ¼ 1
∂2
E
∂w k; nð Þ∂w j; ið Þ
g j; ið Þ ¼ ∑
Nh
k ¼ 1
∑
Nh
j ¼ 1
gT
k Hk;j
R gj
ð8Þ
where column vector gk contains elements g(k, n) of G, for all
values of n. HR is the input weight Hessian with Niw rows and
columns, where Niw ¼(Nþ1) Nh is the number of input weights.
Clearly, HR is reduced in size compared to H, the Hessian for all
weights in the entire network. Hk;j
R contains elements of HR for all
input weights connected to the jth and kth hidden units and has
size (Nþ1) by (Nþ1). When Gauss–Newton [12] updates are used,
elements of HR are computed as
∂2
E
∂wðj; iÞ∂w k; nð Þ
¼ 2
Nv
u j; kð Þ ∑
Nv
p ¼ 1
xp ið Þxp nð Þop' jð Þop' kð Þ
u j; kð Þ ¼ ∑
M
m ¼ 1
woh m; jð Þwohðm; kÞ ð9Þ
o0
p(k) denotes the first partial derivative of op(k) with respect to its
net function. Because (9) represents the Gauss–Newton approx-
imation to the Hessian, it is positive semi-definite. Eq. (8) shows
that (i) the OLF can be obtained from elements of the Hessian HR,
(ii) HR contains useful information even when it is singular, and
(iii) a smaller non-singular Hessian ∂2
E/∂z2
can be constructed
using HR.
2.4. Motivation for further work
Even though the OWO stage of OWO-BP is equivalent to
Newton's algorithm for the output weights, the overall algorithm
is only marginally better than standard BP, and so it is rarely used.
This occurs because the BP stage is first order. The one by one
Hessian used in the OLF calculations is the only one associated
with the input weight matrix W. Therefore, improving the perfor-
mance of OWO-BP may require that we develop a larger Hessian
matrix associated with W.
3. Multiple optimal learning factor algorithm
In this section, we attempt to improve input weight training
by deriving a vector of optimal learning factors, z, which contains
one element for each hidden unit. The resulting improvement to
OWO-BP is called the multiple optimal learning factors (MOLF)
training method. We begin by first introducing the concept of
equivalent networks and identify the conditions under which two
networks can generate the same outputs.
3.1. Discussion of equivalent networks
Let MLP 1 have a net function vector np and output vector yp as
discussed previously in Section 2.1. In a second network called
MLP 2, which is trained similarly, the net function vector and the
output vector are respectively denoted by np' and yp'. The bypass
and output weight matrices along with the hidden unit activation
functions are considered to be equal for both the MLP 1 and MLP 2.
Based on this information the output vectors for MLP 1 and MLP
2 are respectively,
yp ið Þ ¼ ∑
N þ1
n ¼ 1
woi i; nð Þxp nð Þþ ∑
Nh
k ¼ 1
woh i; kð Þf ðnp kð ÞÞ ð10Þ
yp' ið Þ ¼ ∑
N þ 1
n ¼ 1
woi i; nð Þxp nð Þþ ∑
Nh
k ¼ 1
woh i; kð Þf np' kð Þ
À Á
ð11Þ
In order to make MLP 2 strongly equivalent to MLP 1 their
respective output vectors yp and yp' have to be equal. Based on
Eqs. (10) and (11) this can be achieved only when their respective
net function vectors are equal [25]. MLP 2 can be made strongly
equivalent to MLP 1 by linearly transforming its net function
vector np' before passing it as an argument to the activation
function f(). This process is equivalent to directly using the net
function np in MLP 2, denoted as,
np ¼ C Unp' ð12Þ
The linear transformation matrix C used to transform np' in
Eq. (12) is of size Nh-by-Nh. This process of transforming the net
function vector of MLP 2 is represented in Fig. 2.
On applying the linear transformation to the net function
vector n0
p in Eq. (12), MLP 2 is made strongly equivalent to
MLP1. We then get,
yp' ið Þ ¼ yp ið Þ ¼ ∑
N þ1
n ¼ 1
woi i; nð Þxp nð Þþ ∑
Nh
k ¼ 1
woh i; kð Þf ∑
Nh
m ¼ 1
c k; mð Þnp' mð Þ
!
ð13Þ
After making MLP 2 strongly equivalent to MLP 1, the net function
vector np can be related to its input weights W and the net
function vector np' as,
np kð Þ ¼ ∑
N þ1
n ¼ 1
w k; nð Þxp nð Þ ¼ ∑
Nh
m ¼ 1
c k; mð Þnp' ðmÞ ð14Þ
Similarly the MLP 2 net function vector np' can be related to its
weights W0
as
np' mð Þ ¼ ∑
N þ 1
n ¼ 1
w' m; nð Þxp nð Þ ð15Þ
Fig. 2. Graphical representation of net function vectors.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011492
On substituting (15) into (14) we get,
npðkÞ ¼ ∑
N þ 1
n ¼ 1
wðk; nÞxpðnÞ ¼ ∑
N þ 1
n ¼ 1
∑
Nh
m ¼ 1
cðk; mÞUw'ðm; nÞxpðnÞ ð16Þ
In (16) we see that the input weight matrices of MLPs 1 and 2 are
linearly related as
W ¼ C UW' ð17Þ
If the elements of the C matrix are found through an optimality
criterion, then optimal input weights W0
can be computed as
W' ¼ C À 1
UW ð18Þ
3.2. Transformation of net function vectors
In Section 3.1 we showed a method for generating a network,
MLP 2, which is strongly equivalent to the original network, MLP 1.
In this section, we show that equivalent networks train differently.
The output of the transformed network is given in Eq. (14).
The elements of the negative Jacobian matrix G' of MLP 2 found
from Eqs. (2) and (14) are defined as,
g' u; vð Þ ¼ À
∂E
∂w' u; vð Þ
¼
2
Nv
∑
Nv
p ¼ 1
∑
M
i ¼ 1
tp ið ÞÀyp ið Þ
h i
U
∂yp ið Þ
∂w' u; vð Þ
∂yp ið Þ
∂w' u; vð Þ
¼ ∑
Nh
k ¼ 1
woh i; kð Þop' kð Þc k; uð ÞxpðvÞ ð19Þ
o0
p(k) denotes the first partial derivative of op(k). Rearranging
terms in (19) results in,
g'ðu; vÞ ¼ ∑
Nh
k ¼ 1
cðk; uÞ
2
Nv
∑
Nv
p ¼ 1
∑
M
i ¼ 1
½tpðiÞÀypðiÞŠU wohði; kÞo'pðkÞxpðvÞ
¼ ∑
Nh
k ¼ 1
cðk; uÞ
˶E
∂wðk; vÞ
ð20Þ
This is abbreviated as
G' ¼ CT
UG ð21Þ
The input weight update equation for MLP 2 based on its Jacobian
vector G0
is given by,
W' ¼ W' þzUG'
On pre-multiplying this equation by C and using Eqs. (18) and (21)
we obtain
W ¼ W þzUG'' ð22Þ
where,
G'' ¼ RUG
R ¼ C UCT
ð23Þ
Lemma 1. For a given R matrix in Eq. (23), the number of C matrices
is infinite.
Continuing, the transformed Jacobian G'' of MLP 1 is given in
terms of the original Jacobian G as,
g''ðu; vÞ ¼ ∑
Nh
k ¼ 1
rðu; kÞgðk; vÞ
gðk; vÞ ¼
˶E
∂wðk; vÞ
¼
2
Nv
∑
Nv
p ¼ 1
∑
M
i ¼ 1
½tpðiÞÀypðiÞŠUwohði; kÞo0
pðkÞxpðvÞ ð24Þ
where o'p(k) denotes the derivative of op(k) with respect to its net
function. Eqs. (22) and (23) show that valid input weight update
matrices G'' can be generated by simply transforming the weight
update matrix G from MLP 1, using an autocorrelation matrix R.
Since uncountably many matrices R exist, it is natural to wonder
which one is optimal.
3.3. Multiple optimal learning factors for input weights
In this subsection, we explore the idea of finding an optimal
transform matrix R for the negative Jacobian G. However, finding a
dense R matrix would result in an algorithm with almost the
computational complexity of Levenberg–Marquardt (LM). We
avoid this by letting the R matrix in Eq. (23) be diagonal. Under
this case Eq. (22) becomes,
w k; nð Þ ¼ w k; nð ÞþzUr kð Þg k; nð Þ ð25Þ
On comparing Eqs. (22) and (25), the expression for the different
optimal learning factors zk could be given as,
zk ¼ zUr kð Þ ð26Þ
Eq. (26) shows that using the MOLF approach for training a MLP is
equivalent to transforming the net function vector using a diag-
onal transform matrix. Next, we use Newton's method develop our
approach for finding z.
Assume that an MLP is being trained using OWO-BP. Also
assume that a separate OLF zk is being used to update each hidden
unit's input weights, w(k, n), where 1rnr(Nþ1). The error
function to be minimized is given by (2). The predicted output
yp(m) is given by,
ypðmÞ ¼ ∑
N þ 1
n ¼ 1
woiðm; nÞxpðnÞ
þ ∑
Nh
k ¼ 1
wohðm; kÞf ∑
N þ 1
n ¼ 1
ðwðk; nÞþzk Á gðk; nÞÞxpðnÞ
 
where, g(k, n) is an element of the negative Jacobian matrix G and
f() denotes the hidden layer activation function. The negative first
partial of E with respect to zj is
gmolf jð Þ ¼ À
∂E
∂zj
¼
2
Nv
∑
Nv
p ¼ 1
∑
M
m ¼ 1
tp mð ÞÀ ∑
Nh
k ¼ 1
woh m; kð Þop zkð Þ
 #
Uwoh m; jð Þop' jð ÞΔnpðjÞ
ð27Þ
where,
tp mð Þ ¼ tp mð ÞÀ ∑
N þ 1
n ¼ 1
woi m; nð Þxp nð Þ;
Δnp jð Þ ¼ ∑
N þ 1
n ¼ 1
xp nð ÞUg j; nð Þ
op zkð Þ ¼ f ∑
N þ 1
n ¼ 1
ðw k; nð Þþzk Ug k; nð ÞÞxp nð Þ
 
The elements of the Gauss–Newton Hessian Hmolf are
hmolf l; jð Þ %
∂2
E
∂zl∂zj
¼
2
Nv
∑
M
m ¼ 1
woh m; lð Þwoh m; jð Þ ∑
Nv
p ¼ 1
op' lð Þop' jð ÞΔnp lð ÞΔnp jð Þ
¼ ∑
N þ 1
i ¼ 1
∑
N þ 1
n ¼ 1
2
Nv
u l; jð Þ ∑
Nv
p ¼ 1
xp ið Þxp nð Þop' lð Þop' jð Þ
 #
g l; ið ÞUg j; nð Þ ð28Þ
where u(l, j) is defined in (9). The Gauss–Newton update guaran-
tees that Hmolf is non-negative definite. Given Hmolf and the
negative gradient vector gmolf, defined as
gmolf ¼ ½À∂E=∂z1; À∂E=∂z2; …; À∂E=∂zNh
ŠT
We minimize E with respect to the vector z using Newton's
method. If Hmolf and gmolf are the Hessian and negative gradient,
respectively, of the error with respect to z, then the multiple
optimal learning factors are computed by solving
Hmolf Uz ¼ gmolf ð29Þ
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1493
A brief outline of training using the MOLF algorithm is given
below. In each iteration of the training algorithm:
1. Solve linear equations for all output weights using (4)
2. Calculate the negative input weight Jacobian G using BP as
given by (6)
3. Calculate z using Newton's method as shown in (29) and
update the input weights as
w k; nð Þ’w k; nð Þþzk Ug k; nð Þ ð30Þ
Here, the MOLF procedure has been inserted into the OWO-BP
algorithm, resulting in an algorithm we denote as MOLF-BP.
The MOLF procedure can be inserted into other algorithms as well.
3.4. Analysis of the MOLF Hessian
This subsection presents an analysis of the structure of the
MOLF Hessian and its relation to the reduced Hessian HR of the
input weights. Comparing (28) to (9), we see that the term within
the square brackets is an element of HR. Hence,
hmolf l; jð Þ ¼
∂2
E
∂zl∂zj
¼ ∑
N þ 1
i ¼ 1
∑
N þ1
n ¼ 1
∂2
E
∂w l; ið Þ∂w j; nð Þ
g l; ið Þg j; nð Þ ð31Þ
For fixed (l, j), hmolf(l, j) can be expressed as,
∂2
E
∂zl∂zj
¼ ∑
N þ1
i ¼ 1
gl ið Þ ∑
N þ 1
n ¼ 1
h
l;j
R i; nð ÞUgj nð Þ ð32Þ
In matrix notation,
Hmolf ¼ gT
l Hl;j
R gj ð33Þ
where column vector gl
contains G elements g(l, n) for all values of
n, where the (Nþ1) by (Nþ1) matrix HR
l, j
contains elements of HR
for weights connected to the lth and jth hidden units. The reduced
size Hessian HR
l, j
in (33) uses four indices (l, i, j, n) and can be
viewed as a 4-dimensional array, represented by H4
R whose
elements satisfy
h
l;j
R ði; nÞARðN þ 1ÞxNhxNhxðN þ 1Þ
Now, (33) can be rewritten as
H4
molf ¼ GH4
RGT
ð34Þ
From (32), we see that hmolf(l, j)¼h4
molf(l, j, l, j), i.e., the
4-dimensional matrix H4
molf is transformed into the 2-dimensional
matrix Hmolf, by setting m¼l and u¼j. Each element of the MOLF
Hessian combines the information from (Nþ1) rows and columns of
the reduced Hessian, HR
l, j
. This makes MOLF less sensitive to input
conditions. Note the similarities between (8) and (33).
4. Effect of dependence on the MOLF algorithm
In this section, we analyze the performance of MOLF applied to
OWO-BP, in the presence of linear dependence.
4.1. Dependence in the input layer
A linearly dependent input can be modeled as a linear combi-
nation of any or all inputs as
xp Nþ2ð Þ ¼ ∑
N þ 1
n ¼ 1
b nð Þxp nð Þ ð35Þ
During OWO, the weights from the dependent input, feeding the
outputs will be set to zero and the output weight adaptation
will not be affected. During the input weight adaptation, the
expression for gradient given by (27) can be re-written as,
gmolf jð Þ ¼ À
∂E
∂zj
¼
2
Nv
∑
Nv
p ¼ 1
∑
M
m ¼ 1
tp mð ÞÀ ∑
Nh
k ¼ 1
woh m; kð Þop zkð Þ
 #
Uwoh m; jð Þop' jð Þ Δnp jð Þþg k; Nþ2ð Þ ∑
N þ 1
n ¼ 1
b nð Þxp nð Þ
!
ð36Þ
and the expression for an element of the Hessian in (28) can be
re-written as
∂2
E
∂zl∂zj
¼
2
Nv
∑
M
m ¼ 1
woh m; lð Þwoh m; jð Þ ∑
Nv
p ¼ 1
op' lð Þop' jð Þ
 Δnp lð ÞΔnp jð Þþ Δnp lð Þg j; Nþ2ð Þ
 ∑
N þ 1
n ¼ 1
b nð Þxp nð ÞþΔnp jð Þg l; Nþ2ð Þ ∑
N þ 1
i ¼ 1
b ið Þxp ið Þ
þg j; Nþ2ð Þg l; Nþ2ð Þ ∑
N þ1
n ¼ 1
∑
N þ1
i ¼ 1
b ið Þxp ið Þb nð Þxp nð Þ
#
ð37Þ
Let H0
molf be the Hessian, when the extra dependent input xp(Nþ2)
is included. Then,
hmolf' l; jð Þ ¼ hmolf l; jð Þþ
2
Nv
u l; jð Þ ∑
Nv
p ¼ 1
xp Nþ2ð Þop' lð Þop' jð Þ
 Ã
U g l; Nþ2ð Þ ∑
N þ 1
n ¼ 1
xp nð Þg j; nð Þþg j; Nþ2ð Þ ∑
N þ 1
i ¼ 1
xp ið Þg l; ið Þ

þxp Nþ2ð Þg l; Nþ2ð Þg j; Nþ2ð Þ
#
ð38Þ
Comparing (27) with (36) and (28) with (37) and (38), we see
some additional terms that appear within the square brackets in
the expressions for gradient and Hessian in the presence of
linearly dependent input. Clearly, these parasitic cross-terms will
cause the training using MOLF to be different for the case of
linearly dependent inputs. This leads to the following lemma.
Lemma 2. Linearly dependent inputs, when added to the network,
do not force H'molf to be singular.
As seen in (37) and (38), each element h0
molf (m, j) simply gains
some first and second degree terms in the variables b(n).
4.2. Dependence in the hidden layer
Here we look at how a dependent hidden unit affects the
performance of MOLF. Assume that some hidden unit activations
are linearly dependent upon others, as
opðNh þ1Þ ¼ ∑
Nh
k ¼ 1
c kð Þop kð Þ ð39Þ
Further assume that OLS is used to solve for output weights.
The weights in the network are updated during every training
iteration and it is quite possible that this could cause some hidden
units to be linearly dependent upon each other. The dependence
could manifest in hidden units being identical or a linear combi-
nation of other hidden unit outputs.
Consider one of the hidden units to be dependent. The autocorre-
lation matrix, Rx, will be singular and since we are using OLS to solve
for output weights, all the weights connecting the dependent hidden
unit to the outputs will be forced to zero. This will ensure that the
dependent hidden unit does not contribute to the output. To see if this
has any impact on learning in MOLF we can look at the expression for
the gradient and Hessian, given by (27) and (28) respectively. Both
equations have a sum of product terms containing output weights.
Since OWO sets the output weights for the dependent hidden unit to
be zero, this will also set the corresponding gradient and Hessian
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011494
elements to be zero. In general, any dependence in the hidden layer
will cause the corresponding learning factor to be zero and will not
affect the performance of MOLF.
Lemma 3. For each dependent hidden unit, the corresponding row
and column of Hmolf is zero-valued.
This follows from (28). The zero-valued rows and columns in
Hmolf cause no difficulties if (16) is solved using OLS [26].
5. Improving the MOLF algorithm
Section 4.1 showed that the presence of linearly dependent inputs
affect MOLF training, primarily the input weight adaptation. Depen-
dent inputs manifest themselves as additional rows and columns in
the input weight Jacobian matrix G. If we can find a way to set these
additional rows and columns to zero, then we can ensure that the
weights connected to the dependent inputs will not get updated and
the training will be unaffected by the presence of dependent inputs.
In this section, we present a way to mitigate the effect of linearly
dependent inputs on training input weights.
Without loss of generality, let a linearly dependent input
xp(Nþ2) be modeled as in (35), where the first (Nþ1) inputs are
linearly independent. Using the singular value decomposition
(SVD), the input autocorrelation matrix R of dimension (Nþ2) by
(Nþ2) is written as UΣUT
. This can be rewritten as R¼AT
A, where
the transformation matrix A is defined as
A ¼ ΣÀ1=2
UT
Both the last row and the last column of A are zero. Now, when the
input vector with the dependent inputs is transformed as,
x0
p ¼ A Á xp ð40Þ
the last element, x0
p(Nþ2) will be zero. This type of transformation
is termed as whitening transformation [27] in linear algebra and
statistics literature. Now, let us investigate the effect of dependent
inputs on adjusting input weights using MOLF, when whitening
transform is incorporated into training. We need to verify that the
transformed input vector x0
p, with zero-values for the dependent
inputs, causes no problems. The Nh by (Nþ2) gradient matrix G0
for the transformed data is
G' ¼ G Á AT
ð41Þ
where G is the negative Jacobian matrix before the inputs are
transformed. Although the last column of G is dependent upon the
other columns, the whitening transform causes the last column of
G0
to be zero. The fact that x'p(Nþ2) and g(k, Nþ2) are zero causes
(36) to revert to (27) and H0
molf in (38) to equal Hmolf. When OLS is
used to find z and then the output weights, xp(Nþ2) has no effect,
and weights connected to it are not trained.
Eqs. (40) and (41) now provide us with a way to remove line-
arly dependent inputs during training, so their presence has no
effect. This leads us to develop an improved version of MOLF.
The transformation applied to the Jacobian matrix is a whitening
transformation (WT). Whitening transformation has two desirable
properties: (i) decorrelation, which causes the transformed matrix to
have a diagonal covariance matrix and (ii) whitening, which forces
the transformed matrix to be a unit covariance (identity) matrix. We
call the improved algorithm MOLF-WT. The steps to training an MLP
with MOLF-WT are listed below:
1. Compute the input autocorrelation matrix R as given by (4)
2. Use the SVD to compute the whitening transformation matrix A
as described above
3. In each iteration of training
a. Calculate the negative input weight Jacobian G using BP
as given by (6)
b. Calculate the whitening transformed Jacobian G0
as described
in (41) to eliminate any linear dependency in the inputs
c. Calculate z using Newton's method as before, using (29) and
update the input weights as described in (30), except that
G is replaced by G0
d. Solve linear equations for all output weights as before
using (4)
6. Performance comparison and results
This section compares the performance of MOLF-WT, with
several popular existing feed-forward batch learning algorithms.
We begin by listing the various algorithms being compared.
We then identify the metrics used for comparing the performances
of the various algorithms, provide details of the experimental
procedure along with a description of the data files used for
obtaining the numerical results.
6.1. List of competing algorithms
The performance of MOLF-WT is compared with four other
existing methods, namely:
1. Output weight optimization-backpropagation (OWO-BP)
2. Multiple Learning Factor applied to OWO-BP (MOLF-BP)
3. Levenberg–Marquardt (LM)
4. Scaled conjugate gradient (SCG) [28]
All algorithms employ the fully connected feed-forward net-
work architecture described in Fig. 1. The hidden units have
sigmoid activation functions, while the output units all have linear
activations. SCG and LM employ one-stage learning where all
weights in the network are updated simultaneously in each
iteration. OWO-BP, MOLF-BP, and MOLF-WT employ two-stage
learning where we first update the input weights and subse-
quently solve linear equations for the output weights. A single OLF
was used for training in the OWO-BP and LM algorithms, while
multiple OLFs were used for MOLF-BP and MOLF-WT.
6.2. Experimental procedure
Primarily, we make use of the k-fold methodology for our experi-
ments (k¼10 for our simulations). Each data set is split into k non-
overlapping parts of equal size. Of these, k-2 parts (roughly 80%) are
combined and used for training, while each of the remaining two parts
is used for validation and testing respectively (roughly 10% each).
Validation is performed per training iteration (to prevent over train-
ing) and the network with the minimum validation error is saved.
After the training is completed, the network with the minimum
validation error is used with the testing data to obtain the final testing
error, which measures the networks ability to generalize. The max-
imum number of iteration per training session is fixed at 2500 and is
the same for all algorithms. There is also an early stopping criterion.
Training is repeated till all k combinations have been exhausted (i.e.,10
times).
It is known that the initial network plays a role in the networks
ability to learn. In order to make our performance comparisons
independent of the initial network, we carry out the k-fold
methodology 5 times, each time with a different initial network.
This leads to a total of 50 different training sessions per algorithm,
with 50 different initial networks being used. We call this the
5k-fold method. For a given data set, even though the initial weight
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1495
is different in each training session, we use the same 50 initial
weight sets across all algorithms. This is done to allow for a fair
comparison between the various training methods.
We use net control [19] to adjust the weights before training
begins. In net control, the randomly initialized input weights and
hidden unit thresholds are scaled and biased so that each hidden
unit's net function has the same mean and standard deviation.
Specifically, the net function means are equal to 0.5 and the corre-
sponding standard deviations are equal to 1. In addition, OWO is used
to find the initial output weights. This means that all the algorithms
listed above have the same starting point (a different one for each
session) for all data files. This will be evident in the plots presented
later in this section, where we can see that the MSE for the first
iteration for all algorithms is the same.
At the end of the 5k-fold procedure, we compute (i) the average
training error, (ii) the average minimum validation error, (iii) the
average testing error, and (iv) the average number of cumulative
multiplies required for training, for each algorithm and for each
data file. The average is computed over 50 sessions. These
quantities form the metrics for comparison and are subsequently
used to generate the plots and tables to compare the performance
of different learning algorithms.
6.3. Computational cost
The proposed MOLF algorithm involves inversion a Hessian
matrix, which has a general complexity of O(N3
), for a square
matrix of size N. However, compared to Newton's method or LM,
the size of the Hessian is much smaller. Updating weights using
Newton's method or LM, requires a Hessian with Nw rows and
columns, where Nw ¼M Á Nu þ(Nþ1)Nh and Nu ¼NþNh þ1. In com-
parison, the Hessian used in the proposed MOLF has only Nh rows
and columns. The number of multiplies required to solve for
output weights using OLS is given by
Mols ¼ Nu Nu þ1ð Þ Mþ
1
6
Nu 2Nu þ1ð Þþ
3
2
!
ð42Þ
The numbers of multiplies per training iteration for OWO-BP, LM
and MOLF are given below
Mowo Àbp ¼ Nv½ 2Nh Nþ2ð ÞþMðNu þ1Þ þðNu Nu þ1ð ÞÞ=2þM Nþ6Nh þ4ð ÞŠ
þMols þNh Nþ1ð Þ ð43Þ
Mlm ¼ Nv½MNu þ2Nh Nþ1ð ÞþM Nþ6Nh þ4ð ÞþMNu Nu þ3Nh Nþ1ð Þð Þ
þ4N2
h Nþ1ð Þ2
ŠþN3
w þN2
w ð44Þ
Mscg ¼ 4Nv Nh Nþ1ð ÞþMNu½ Šþ10 Nh Nþ1ð ÞþMNu½ Š ð45Þ
Mmolf ¼ MowoÀbp þNv Nh Nþ4ð ÞÀM Nþ6Nh þ4ð Þ½ ŠþN3
h ð46Þ
Note that Mmolf consists of Mowo-bp plus the required multiplies for
calculating optimal learning factors. Similarly, Mlm consists of Mbp
plus the required multiplies for calculating and inverting the
Hessian matrix. We will see later in the plots that despite the
additional complexity of Hessian inversion, the number of cumu-
lative multiplies to train a network using MOLF is almost the same
as OWO-BP or SCG algorithms, which do not involve any matrix
inversion operation. We will also see that LM on the other hand
has a far greater overhead due the inversion of a much larger
Hessian matrix.
6.4. Training data sets
Table 1 lists seven data sets used for simulations and perfor-
mance comparisons.
All the data sets are publicly available. In all data sets, the
inputs have been normalized to be zero-mean and unit variance.
This way, it becomes clear that MOLF is not equivalent to a mere
normalization of the data.
The number of hidden units for each data set is selected by
first training a multi-layer perceptron with a large number of
hidden units followed by a step-wise pruning with validation [29].
The least useful hidden unit is removed at each step till there is only
one hidden unit left. The number of hidden units corresponding to the
smallest validation error is used for training on that data set.
6.5. Results
Numerical results obtained from our training, validation and
testing methodology for the proposed MOLF-WT algorithm are
compared with those of OWO-BP, MOLF-BP, LM and SCG, using the
data files listed in Table 1. Two plots are generated for each data
file: (i) the average mean square error (MSE) for training from
10-fold training is plotted versus the number of iterations (shown
on a log10 scale) and (ii) the average training MSE from 10-fold
training is plotted versus the cumulative number of multiplies
(also shown on a log10 scale) for each algorithm. Together, the two
plots will not only show the final average MSE achieved by
training but also the computations it consumed for getting there.
6.5.1. Prognostics data set
This data file is available at the Image Processing and Neural
Networks Lab repository [30]. It consists of parameters that are
available in the Bell Helicopter health and usage monitoring
system (HUMS), which performs flight load synthesis, which is a
form of prognostics [31]. The data set containing 17 input features
and 9 output parameters.
We trained an MLP having 13 hidden units. In Fig. 3-a, the
average mean square error (MSE) for 10-fold training is plotted
versus the number of iterations for each algorithm. In Fig. 3-b, the
average 10-fold training MSE is plotted versus the required number
of multiplies (shown on a log10 scale). From Fig. 3-a, LM and MOLF
based algorithms have almost identical performance (LM is margin-
ally better than MOLF-WT which in turn is marginally better than
MOLF-BP). However, the performance of LM comes with a signifi-
cantly higher computational demand, as shown in Fig. 3-b. Also,
MOLF based algorithms are better than OWO-BP and SCG.
6.5.2. Federal reserve economic data set
This file contains some economic data for the USA from 01/04/
1980 to 02/04/2000 on a weekly basis. From the given features, the
goal is to predict the 1-Month CD Rate [32]. For this data file, called
TR on its webpage [33], we trained an MLP having 34 hidden units.
From Fig. 4-a, LM has the best overall performance, followed
closely by MOLF-WT. The MOLF-WT algorithm is better than
MOLF-BP. Fig. 4-b shows the computational cost of achieving this
Table 1
Data set description.
Data set name No. of
inputs
No. of
outputs
No. of
patterns
No. of Hidden
Units
Prognostics 17 9 4745 13
Federal reserve 15 1 1049 34
Housing 16 1 22,784 30
Concrete 8 1 1030 13
Remote sensing 16 3 5992 25
White wine
quality
11 1 4898 24
Parkinson's 16 2 5875 17
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011496
performance. The proposed MOLF algorithms consume slightly
more computation than OWO-BP and SCG. However, all the first
order algorithms utilized about two orders of magnitude fewer
computations than LM.
6.5.3. Housing data set
This data file is available on the DELVE data set repository [34].
The data was collected as part of the 1990 US census. For the
purpose of this data set a level State-Place was used. Data from all
states were obtained. Most of the counts were changed into
appropriate proportions. These are all concerned with predicting
the median price of houses in a region based on demographic
composition and the state of the housing market in the region.
House-16H data was used in our simulation, with ‘H’ standing for
high difficulty. Tasks with high difficulty have had their attributes
chosen to make the modeling more difficult due to higher variance
or lower correlation of the inputs to the target. We trained an MLP
having 30 hidden units. From Figs. 5-a and -b, we see that the LM
had the best performance, followed closely by SCG. The proposed
MOLF-WT algorithm has a slight edge over MOLF-BP.
6.5.4. Concrete compressive strength data set
This data file is available on the UCI Machine Learning Repo-
sitory [35]. It contains the actual concrete compressive strength
(MPa) for a given mixture under a specific age (days) determined
from laboratory. The concrete compressive strength is a highly
nonlinear function of age and ingredients. We trained an MLP
having 13 hidden units. For this data set, Fig. 6 shows that LM has
the best overall average training error, followed very closely by the
improved MOLF-WT, which is marginally better than MOLF-BP.
SCG attains a performance almost identical to MOLF-WT.
6.5.5. Remote sensing data set
This data file is available on the Image Processing and Neural
Networks Lab repository [30]. It consists of 16 inputs and 3 outputs
and represents the training set for inversion of surface permittiv-
ity, the normalized surface rms roughness, and the surface
correlation length found in back scattering models from randomly
rough dielectric surfaces [36]. We trained an MLP having 25
hidden units. From Fig. 7-a, the MOLF-WT has the best overall
average training MSE.
Fig. 4. Federal reserve data: average error vs. (a) iterations and (b) multiplies per iteration.
Fig. 3. Prognostics data set: average error vs. (a) iterations and (b) multiplies per iteration.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1497
6.5.6. White wine quality data set
The Wine Quality-White data set [37] was created using white
wine samples. The inputs include objective tests (e.g. PH values)
and the output is based on sensory data (median of at least
3 evaluations made by wine experts). Each expert graded the wine
quality between 0 (very bad) and 10 (very excellent). This data set
is available on the UCI data repository [35]. For this data file, the
number of hidden units was chosen to be 24. For this data set,
Fig. 8 shows that LM and the improved MOLF-WT have almost
identical performance and are better than the rest.
6.5.7. Parkinson's disease data set
The Parkinson's data set [38] is available on the UCI data repository
[35]. It is composed of a range of biomedical voice measurements
from 42 people with early-stage Parkinson's disease recruited to a
six-month trial of a telemonitoring device for remote symptom
progression monitoring. The main aim of the data is to predict the
motor and total UPDRS scores (‘motor_UPDRS’ and ‘total_UPDRS’)
from the 16 voice measures. For this data set the number of hidden
units was chosen to be 17. From Fig. 9, LM attains the best overall
training error, followed by the improved MOLF-WT. We can see again
that the improved MOLF-WT performs better than MOLF-BP.
Table 2 is a compilation of the metrics used for comparing the
performance of all algorithms on all data sets. It comprises of the
average minimum testing and validation errors. Based on the results
obtained on the selected data sets, we can make a few observations.
From the training plots and the data from Table 2, we can say that
1. LM is the top performer in 4 out of the 7 data sets. However, LM
being a second order method, this performance comes at a
significant cost of computation – almost two orders of magni-
tude greater than the rest;
2. Both SCG and OWO-BP consistently appear in the last two
places in terms of average training error on 5 out of 7 data sets
(Housing and Concrete data sets being the exception). How-
ever, being first order methods, they set the bar for the lowest
cost of computation;
3. Both MOLF-BP and MOLF-WT always perform better than their
common predecessor, OWO-BP, on all data sets and they are
Fig. 5. Housing data: average error vs. (a) iterations and (b) multiplies per iteration.
Fig. 6. Concrete data: average error vs. (a) iterations and (b) multiplies per iteration.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011498
better than SCG except for the housing data set. This perfor-
mance is achieved with minimal computational overhead
compared to both SCG and OWO-BP, as evident in the plots of
error vs. number of cumulative multiplications;
4. MOLF-BP is almost never better than LM, except for the remote
sensing data set where it is marginally better;
5. MOLF-WT always performs better than its immediate prede-
cessor MOLF-BP (although MOLF-BP comes very close on 3 of
the 7 data sets);
6. MOLF-WT achieves a better performance than LM on one data
set (remote sensing) and has almost identical performance on
two other data sets (wine and prognostics). It is also a close
second on 3 of the remaining 4 data sets. The same cannot be
said of MOLF-BP;
7. LM is again consistently the top performer with the best
average testing error on 6 out 7 data sets, while MOLF based
algorithms figure consistently in the top two performers (note
that the network with the smallest validation error is used to
obtain the testing error).
In general, we can see that
1. Both MOLF-BP and the improved MOLF-WT algorithms con-
sistently outperform OWO-BP in all three phases of learning
(i.e. MOLF inserted into OWO-BP does help improve its perfor-
mance in training, validation and testing);
2. Both MOLF-BP and the improved MOLF-WT algorithms con-
sistently outperform SCG in terms of the average minimum
testing error;
3. The MOLF-WT very often has a performance comparable to LM
and it attains that with minimal computational overhead;
4. While MOLF-BP is an improvement over OWO-BP, it is not as
good as MOLF-WT.
7. Conclusions
Starting with the concept of equivalent networks and its
implications, a class of hybrid learning algorithms is presented
Fig. 7. Remote sensing data: average error vs. (a) iterations and (b) multiplies per iteration.
Fig. 8. Wine quality data: average error vs. (a) iterations and (b) multiplies per iteration.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1499
that uses first and second order information to simultaneously
calculate optimal learning factors for all hidden units and train the
input weights. Detailed analysis of the structure of the Hessian
matrix for the proposed algorithm in relation to the full Newton
Hessian matrix is presented, showing how the smaller MOLF
Hessian could potentially be less ill-conditioned. Elements of the
reduced size Hessian are shown to be weighted sums of the
elements of the entire network's Hessian. Limitations of the basic
MOLF procedure in the presence of linearly dependent inputs is
shown mathematically. The limitations are used as a launch pad to
develop a version of MOLF that incorporates a whitening trans-
formation type operation to mitigate the effect of linearly depen-
dent inputs. This version is called MOLF-WT.
A detailed experimental procedure is presented and the perfor-
mance of MOLF-BP and MOLF-WT are compared to 3 other existing
feed-forward batch learning algorithms on seven publicly available
data sets. For the data sets investigated, both MOLF-BP and MOLF-WT
are about as computationally efficient per iteration as OWO-BP and
SCG. MOLF-BP consistently performs better than OWO-BP. In effect
the MOLF procedure improves OWO-BP by using an Nh by Nh Hessian
in place of the normal 1 by 1 Hessian used to calculate a scalar OLF.
MOLF-WT has a performance comparable and in some cases better
than that of LM. From the results, MOLF-WT stays close to the
computational cost of OWO-BP and SCG to achieve a performance
comparable with LM and may be a viable alternative to LM in terms of
efficiency of training, minimum validation error and testing errors.
Since they use smaller Hessian matrices, the MOLF methods
presented in this paper require fewer multiplies per iteration than
traditional second order methods based on Newton's method.
Finally, it is clear from the figures that MOLF-BP and MOLF-WT
decrease the training error much faster than the other algorithms.
In future work, the MOLF approach will be applied to other
training algorithms and to networks having additional hidden layers.
An alternate approach, in which each network layer has its own
optimal learning factor, will also be tried. In addition, the MOLF
algorithm will be extended to handle classification type applications.
References
[1] R.C. Odom, P. Pavlakos, S.S. Diocee, S.M. Bailey, D.M. Zander, J.J. Gillespie, Shaly
sand analysis using density-neutron porosities from a cased-hole pulsed
neutron system, SPE Rocky Mountain Regional Meeting Proceedings: Society
of Petroleum Engineers, 1999, pp. 467–476.
[2] A. Khotanzad, M.H. Davis, A. Abaye, D.J. Maratukulam, An artificial neural
network hourly temperature forecaster with applications in load forecasting,
IEEE Trans. Power Syst. 11 (2) (1996) 870–876.
[3] S. Marinai, M. Gori, G. Soda, Artificial neural networks for document analysis
and recognition, IEEE Trans. Pattern Anal. Mach. Intell. 27 (1) (2005) 23–35.
[4] J. Kamruzzaman, R.A. Sarker, R. Begg, Artificial Neural Networks: Applications
in Finance and Manufacturing, Idea Group Inc (IGI), Hershey, PA, 2006.
Fig. 9. Parkinson's disease data: average error vs. (a) iterations and (b) multiplies per iteration.
Table 2
Statistics of testing and validation errors.
Comparison metric Data file name OWO-BP SCG MOLF-BP MOLF-WT LM
Average minimum
testing error
Prognostics 2.4187E7 2.6845E7 1.5985E7 1.4973E7 1.4509E7
Federal reserve 0.108636974 0.33502068 0.105951897 0.120668845 0.103284696
Housing 1.7329E9 1.7441E9 1.2758E9 1.3599E9 1.1627E9
Concrete 33.48613532 83.78832466 32.57090477 33.72180127 29.73279111
Remote sensing 1.568023633 1.51843944 2.189078998 0.136484007 0.637828985
White wine 0.513739373 1.25867336 0.507925435 0.514729807 0.499294581
Parkinson's 139.0323995 125.5411318 129.4456136 127.3415428 122.2485104
Average minimum
validation error
Prognostics 2.4764E7 2.6675E7 1.5999E7 1.5155E7 1.4329E7
Federal reserve 0.108442923 0.17653648 0.09675038 0.10759092 0.09371866
Housing 1.3512E9 1.1683E9 1.2414E9 1.2109E9 1.0662E9
Concrete 33.43842214 48.63268454 29.52707558 28.72108048 26.4212992
Remote sensing 1.534430917 0.71842796 0.57845174 0.15130356 0.37738708
White wine 0.503644249 0.50386288 0.49644024 0.49550544 0.49058676
Parkinson's 131.2226476 124.3035235 124.0881855 122.3187088 117.3367678
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011500
[5] L. Wang, X. Fu, Data Mining With Computational Intelligence, Springer-Verlag,
New York, Inc, Seacaucus, NJ, USA, 2005.
[6] G. Cybenko, Approximations by superposition of a sigmoidal function, Math.
Control. Signals Syst. (MCSS) 2 (1989) 303–314.
[7] D.W. Ruck, et al., The multi-layer perceptron as an approximation to a Bayes
optimal discriminant function, IEEE Trans. Neural Netw. 1 (4) (1990) 296–298.
[8] Q. Yu, S.J. Apollo, M.T. Manry, MAP estimation and the multi-layer perceptron,
in: Proceedings of the 1993 IEEE Workshop on Neural Networks for Signal
Processing. Linthicum Heights, Maryland, September 6–9, 1993, pp. 30–39.
[9] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by
error propagation, in: D.E. Rumelhart, J.L. McClelland (Eds.), Parallel Distrib-
uted Processing, vol. I, The MIT Press, Cambridge, Massachusetts, 1986.
[10] J.P. Fitch, S.K. Lehman, F.U. Dowla, S.Y. Lu, E.M. Johansson, D.M. Goodman, Ship
wake-detection procedure using conjugate gradient trained artificial neural
networks, IEEE Trans. Geosci. Remote Sens. 29 (5) (1991) 718–726.
[11] S. McLoone, G. Irwin, A variable memory quasi-Newton training algorithm,
Neural Process. Lett. 9 (1999) 77–89.
[12] A.J. Shepherd, Second-Order Methods for Neural Networks, Springer-Verlag,
New York, Inc., 1997.
[13] K. Levenberg, A method for the solution of certain problems in least squares,
Q. Appl. Math. 2 (1944) 164–168.
[14] D. Marquardt, An algorithm for least-squares estimation of nonlinear para-
meters, SIAM J. Appl. Math. 11 (1963) 431–441.
[15] Changhua Yu, Michael T. Manry, Jiang Li, Effects of nonsingular pre-processing
on feed-forward network training, Int. J. Pattern Recognit. Artif. Intell. 19 (2)
(2005) 217–247.
[16] S. Ergezinger, E. Thomsen, An accelerated learning algorithm for multilayer
perceptrons: optimization layer by layer, IEEE Trans. Neural Netw. 6 (1) (1995)
31–42.
[17] M.T. Manry, S.J. Apollo, L.S. Allen, W.D. Lyle, W. Gong, M.S. Dawson, A.K. Fung,
Fast training of neural networks for remote sensing, Remote Sens. Rev. 9
(1994) 77–96.
[18] S.A. Barton, A matrix method for optimizing a neural network, Neural Comput.
3 (3) (1991) 450–459.
[19] J. Olvera, X. Guan, M.T. Manry, Theory of monomial networks, in: Proceedings
of SINS'92, Automation and Robotics Research Institute, Fort Worth, Texas,
1992, pp. 96–101.
[20] G.D. Magoulas, M.N. Vrahatis, G.S. Androulakis, Improving the convergence of
the backpropagation algorithm using learning rate adaptation methods,
Neural Comput. 11 (7) (1999) 1769–1796 (October 1).
[21] Amit Bhaya, Eugenius Kaszkurewicz, Steepest descent with momentum for
quadratic functions is a version of the conjugate gradient method, Neural
Netw. 17 (1) (2004) 65–71.
[22] S.S. Malalur, M.T. Manry, Multiple optimal learning factors for feed-forward
networks, in: Proceedings of SPIE: Independent Component Analyses, Wave-
lets, Neural Networks, Biosystems, and Nanoengineering VIII, vol. 7703,
Orlando Florida, April 7–9, 2010, 77030F.
[23] P. Jesudhas, M.T. Manry, R. Rawat, S. Malalur, Analysis and improvement of
multiple optimal learning factors for feed-forward networks, in: Proceedings
of International Joint Conference on Neural Networks (IJCNN), San Jose,
California, USA, July 31–August 5, 2011, pp. 2593–2600.
[24] W. Kaminski, P. Strumillo, Kernel orthonormalization in radial basis function
neural networks, IEEE Trans. Neural Netw. 8 (5) (1997) 1177–1183.
[25] R.P. Lippman, An introduction to computing with neural nets, IEEE ASSP
Magazine (1987) (April).
[26] R. Battiti, First- and second-order methods for learning: between steepest
descent and Newton's method, Neural Comput. 4 (1992) 141–166.
[27] S. Raudys, Statistical and Neural Classifiers: An Integrated Approach to Design,
Springer-Verlag, London, 2001.
[28] M.F. Moller, A scaled conjugate gradient algorithm for fast supervised
learning, Neural Netw. 6 (1993) 525–533.
[29] P.L. Narasimha, W.H. Delashmit, M.T. Manry, Jiang Li, and F. Maldonado, An
integrated growing-pruning method for feed-forward network training,
NeuroComputing 71 (2008) 2831–2847.
[30] University of Texas at Arlington, Training Data Files – 〈http://www.uta.edu/
faculty/manry/new_mapping.html〉.
[31] M.T. Manry, H. Chandrasekaran, C.-H. Hsieh, Signal Processing Applications of
the Multilayer Perceptron, book chapter in Handbook of Neural Network
Signal Processing, in: Yu Hen Hu, Jenq-Nenq Hwang (Eds.), CRC Press, Boca
Raton, 2001.
[32] US Census Bureau [〈http://www.census.gov〉] (under Lookup Access [〈http://
www.census.gov/cdrom/lookup〉]: Summary Tape File 1).
[33] Bilkent University, Function Approximation Repository – 〈http://funapp.cs.
bilkent.edu.tr/DataSets/〉.
[34] University of Toronto, Delve Data Sets – 〈http://www.cs.toronto.edu/~delve/
data/datasets.html〉.
[35] A. Frank, A. Asuncion, UCI Machine Learning Repository, University of
California, School of Information and Computer Science, Irvine, CA, 2010
http://archive.ics.uci.edu/ml.
[36] A.K. Fung, Z. Li, K.S. Chen, Back scattering from a randomly rough dielectric
surface, IEEE Trans. Geosci. Remote. Sens. 30 (2) (1992).
[37] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis., Modeling wine preferences
by data mining from physicochemical properties, Decision Support Systems,
vol. 47, Elsevier (2009) 547–553 (ISSN: 0167-9236).
[38] A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig, Accurate telemonitoring of
Parkinson's disease progression by non-invasive speech tests, IEEE Trans.
Biomed. Eng. 57 (4) (2009) 884–893.
Sanjeev S. Malalur received his M.S. and Ph.D. in
Electrical Engineering from The University of Texas at
Arlington, mentored by Prof. Michael T. Manry. For his
doctoral dissertation, he developed a family of fast and
robust training algorithms for the multi-layer percep-
tron feed-forward neural networks. He developed a
fingerprint verification system as a part of his Master's
thesis. He is currently employed by Hunter Well
Science where he continues to extend his research by
developing techniques for modeling, analyzing and
interpreting well log data. His research interests
include machine learning algorithms, with an emphasis
on neural networks, signal and image processing tech-
niques, statistical pattern recognition and biometrics. He has authored several
peer-reviewed publications in his field of research.
Michael T. Manry is a professor at the University of
Texas at Arlington (UTA) and his experience includes
work in neural networks for the past 25 years. He is
currently the director of the Image Processing and
Neural Networks Lab (IPNNL) at UTA. He has published
more than 150 papers and he is a senior member of the
IEEE. He is an associate editor of Neurocomputing, and
also a referee for several journals. His recent work,
sponsored by the Advanced Technology Program of the
state of Texas, Bell Helicopter Textron, E-Systems, Mobil
Research, and NASA has involved the development of
techniques for the analysis and fast design of neural
networks for image processing, parameter estimation,
and pattern classification. He has served as a consultant for the Office of Missile
Electronic Warfare at White Sands Missile Range, MICOM (Missile Command) at
Redstone Arsenal, NSF, Texas Instruments, Geophysics International, Halliburton
Logging Services, Mobil Research, Williams-Pyro and Verity Instruments. Dr. Manry
is President of Neural Decision Lab LLC, which is a member of the Arlington
Technology Incubator. The company commercializes software developed by
the IPNNL.
Praveen Jesudhas received the B.E in Electrical Engineer-
ing from Anna University, India in 2008 and the M.S.
degree in Electrical Engineering from the University of
Texas at Arlington, USA in 2010. In his university research,
Praveen worked on problems in iris detection, iris feature
extraction, and second order training algorithms for neural
nets. As a Pattern Recognition Intern at FastVDO LLC in
Maryland, he developed particle filter-based vehicle track-
ing algorithms and Adaboost-based vehicle detection soft-
ware. Currently he is employed at Egrabber, India as a
Research Engineer, where he uses machine learning to
solve problems related to automated online information
retrieval. His current interests include neural networks,
machine learning, natural language processing and big data analytics. He is a member of
Tau Beta Pi.
S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1501

More Related Content

What's hot

A Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak TransformA Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak TransformCSCJournals
 
An optimal general type-2 fuzzy controller for Urban Traffic Network
An optimal general type-2 fuzzy controller for Urban Traffic NetworkAn optimal general type-2 fuzzy controller for Urban Traffic Network
An optimal general type-2 fuzzy controller for Urban Traffic NetworkISA Interchange
 
Local Binary Fitting Segmentation by Cooperative Quantum Particle Optimization
Local Binary Fitting Segmentation by Cooperative Quantum Particle OptimizationLocal Binary Fitting Segmentation by Cooperative Quantum Particle Optimization
Local Binary Fitting Segmentation by Cooperative Quantum Particle OptimizationTELKOMNIKA JOURNAL
 
Efficient design of feedforward network for pattern classification
Efficient design of feedforward network for pattern classificationEfficient design of feedforward network for pattern classification
Efficient design of feedforward network for pattern classificationIOSR Journals
 
Mlp mixer image_process_210613 deeplearning paper review!
Mlp mixer image_process_210613 deeplearning paper review!Mlp mixer image_process_210613 deeplearning paper review!
Mlp mixer image_process_210613 deeplearning paper review!taeseon ryu
 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingSubhajit Sahu
 
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of IraqCrude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of IraqKiogyf
 
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...cseij
 
llorma_jmlr copy
llorma_jmlr copyllorma_jmlr copy
llorma_jmlr copyGuy Lebanon
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor Systemijtsrd
 
Representations for large-scale (Big) Sequence Data Mining
Representations for large-scale (Big) Sequence Data MiningRepresentations for large-scale (Big) Sequence Data Mining
Representations for large-scale (Big) Sequence Data MiningVijay Raghavan
 
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTORARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTORijac123
 
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...grssieee
 
Efficient Block Classification of Computer Screen Images for Desktop Sharing ...
Efficient Block Classification of Computer Screen Images for Desktop Sharing ...Efficient Block Classification of Computer Screen Images for Desktop Sharing ...
Efficient Block Classification of Computer Screen Images for Desktop Sharing ...DR.P.S.JAGADEESH KUMAR
 
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...IDES Editor
 
A STUDY OF METHODS FOR TRAINING WITH DIFFERENT DATASETS IN IMAGE CLASSIFICATION
A STUDY OF METHODS FOR TRAINING WITH DIFFERENT DATASETS IN IMAGE CLASSIFICATIONA STUDY OF METHODS FOR TRAINING WITH DIFFERENT DATASETS IN IMAGE CLASSIFICATION
A STUDY OF METHODS FOR TRAINING WITH DIFFERENT DATASETS IN IMAGE CLASSIFICATIONADEIJ Journal
 

What's hot (20)

A Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak TransformA Text-Independent Speaker Identification System based on The Zak Transform
A Text-Independent Speaker Identification System based on The Zak Transform
 
An optimal general type-2 fuzzy controller for Urban Traffic Network
An optimal general type-2 fuzzy controller for Urban Traffic NetworkAn optimal general type-2 fuzzy controller for Urban Traffic Network
An optimal general type-2 fuzzy controller for Urban Traffic Network
 
Local Binary Fitting Segmentation by Cooperative Quantum Particle Optimization
Local Binary Fitting Segmentation by Cooperative Quantum Particle OptimizationLocal Binary Fitting Segmentation by Cooperative Quantum Particle Optimization
Local Binary Fitting Segmentation by Cooperative Quantum Particle Optimization
 
Efficient design of feedforward network for pattern classification
Efficient design of feedforward network for pattern classificationEfficient design of feedforward network for pattern classification
Efficient design of feedforward network for pattern classification
 
Mlp mixer image_process_210613 deeplearning paper review!
Mlp mixer image_process_210613 deeplearning paper review!Mlp mixer image_process_210613 deeplearning paper review!
Mlp mixer image_process_210613 deeplearning paper review!
 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning Training
 
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of IraqCrude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
 
ssc_icml13
ssc_icml13ssc_icml13
ssc_icml13
 
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...
 
llorma_jmlr copy
llorma_jmlr copyllorma_jmlr copy
llorma_jmlr copy
 
lcr
lcrlcr
lcr
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor System
 
Representations for large-scale (Big) Sequence Data Mining
Representations for large-scale (Big) Sequence Data MiningRepresentations for large-scale (Big) Sequence Data Mining
Representations for large-scale (Big) Sequence Data Mining
 
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTORARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
 
Radial Basis Function
Radial Basis FunctionRadial Basis Function
Radial Basis Function
 
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
 
Efficient Block Classification of Computer Screen Images for Desktop Sharing ...
Efficient Block Classification of Computer Screen Images for Desktop Sharing ...Efficient Block Classification of Computer Screen Images for Desktop Sharing ...
Efficient Block Classification of Computer Screen Images for Desktop Sharing ...
 
A0270107
A0270107A0270107
A0270107
 
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...
Optimal Synthesis of Array Pattern for Concentric Circular Antenna Array Usin...
 
A STUDY OF METHODS FOR TRAINING WITH DIFFERENT DATASETS IN IMAGE CLASSIFICATION
A STUDY OF METHODS FOR TRAINING WITH DIFFERENT DATASETS IN IMAGE CLASSIFICATIONA STUDY OF METHODS FOR TRAINING WITH DIFFERENT DATASETS IN IMAGE CLASSIFICATION
A STUDY OF METHODS FOR TRAINING WITH DIFFERENT DATASETS IN IMAGE CLASSIFICATION
 

Similar to 1-s2.0-S092523121401087X-main

A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...Cemal Ardil
 
NEURALNETWORKS_DM_SOWMYAJYOTHI.pdf
NEURALNETWORKS_DM_SOWMYAJYOTHI.pdfNEURALNETWORKS_DM_SOWMYAJYOTHI.pdf
NEURALNETWORKS_DM_SOWMYAJYOTHI.pdfSowmyaJyothi3
 
High performance extreme learning machines a complete toolbox for big data a...
High performance extreme learning machines  a complete toolbox for big data a...High performance extreme learning machines  a complete toolbox for big data a...
High performance extreme learning machines a complete toolbox for big data a...redpel dot com
 
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...Analysis of Influences of memory on Cognitive load Using Neural Network Back ...
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...ijdmtaiir
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...ijistjournal
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...ijistjournal
 
Complexity analysis of multilayer perceptron neural network embedded into a w...
Complexity analysis of multilayer perceptron neural network embedded into a w...Complexity analysis of multilayer perceptron neural network embedded into a w...
Complexity analysis of multilayer perceptron neural network embedded into a w...Amir Shokri
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer PerceptronsESCOM
 
ML_Unit_2_Part_A
ML_Unit_2_Part_AML_Unit_2_Part_A
ML_Unit_2_Part_ASrimatre K
 
An Alternative Genetic Algorithm to Optimize OSPF Weights
An Alternative Genetic Algorithm to Optimize OSPF WeightsAn Alternative Genetic Algorithm to Optimize OSPF Weights
An Alternative Genetic Algorithm to Optimize OSPF WeightsEM Legacy
 
machine learning for engineering students
machine learning for engineering studentsmachine learning for engineering students
machine learning for engineering studentsKavitabani1
 
Black-box modeling of nonlinear system using evolutionary neural NARX model
Black-box modeling of nonlinear system using evolutionary neural NARX modelBlack-box modeling of nonlinear system using evolutionary neural NARX model
Black-box modeling of nonlinear system using evolutionary neural NARX modelIJECEIAES
 
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...cscpconf
 
Design of airfoil using backpropagation training with mixed approach
Design of airfoil using backpropagation training with mixed approachDesign of airfoil using backpropagation training with mixed approach
Design of airfoil using backpropagation training with mixed approachEditor Jacotech
 
Design of airfoil using backpropagation training with mixed approach
Design of airfoil using backpropagation training with mixed approachDesign of airfoil using backpropagation training with mixed approach
Design of airfoil using backpropagation training with mixed approachEditor Jacotech
 

Similar to 1-s2.0-S092523121401087X-main (20)

Analysis_molf
Analysis_molfAnalysis_molf
Analysis_molf
 
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...
 
H046014853
H046014853H046014853
H046014853
 
NEURALNETWORKS_DM_SOWMYAJYOTHI.pdf
NEURALNETWORKS_DM_SOWMYAJYOTHI.pdfNEURALNETWORKS_DM_SOWMYAJYOTHI.pdf
NEURALNETWORKS_DM_SOWMYAJYOTHI.pdf
 
High performance extreme learning machines a complete toolbox for big data a...
High performance extreme learning machines  a complete toolbox for big data a...High performance extreme learning machines  a complete toolbox for big data a...
High performance extreme learning machines a complete toolbox for big data a...
 
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...Analysis of Influences of memory on Cognitive load Using Neural Network Back ...
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
 
Ffnn
FfnnFfnn
Ffnn
 
Complexity analysis of multilayer perceptron neural network embedded into a w...
Complexity analysis of multilayer perceptron neural network embedded into a w...Complexity analysis of multilayer perceptron neural network embedded into a w...
Complexity analysis of multilayer perceptron neural network embedded into a w...
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
Jf3515881595
Jf3515881595Jf3515881595
Jf3515881595
 
ML_Unit_2_Part_A
ML_Unit_2_Part_AML_Unit_2_Part_A
ML_Unit_2_Part_A
 
An Alternative Genetic Algorithm to Optimize OSPF Weights
An Alternative Genetic Algorithm to Optimize OSPF WeightsAn Alternative Genetic Algorithm to Optimize OSPF Weights
An Alternative Genetic Algorithm to Optimize OSPF Weights
 
machine learning for engineering students
machine learning for engineering studentsmachine learning for engineering students
machine learning for engineering students
 
Black-box modeling of nonlinear system using evolutionary neural NARX model
Black-box modeling of nonlinear system using evolutionary neural NARX modelBlack-box modeling of nonlinear system using evolutionary neural NARX model
Black-box modeling of nonlinear system using evolutionary neural NARX model
 
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
 
tsopze2011
tsopze2011tsopze2011
tsopze2011
 
Design of airfoil using backpropagation training with mixed approach
Design of airfoil using backpropagation training with mixed approachDesign of airfoil using backpropagation training with mixed approach
Design of airfoil using backpropagation training with mixed approach
 
Design of airfoil using backpropagation training with mixed approach
Design of airfoil using backpropagation training with mixed approachDesign of airfoil using backpropagation training with mixed approach
Design of airfoil using backpropagation training with mixed approach
 

1-s2.0-S092523121401087X-main

  • 1. Multiple optimal learning factors for the multi-layer perceptron Sanjeev S. Malalur, Michael T. Manry n , Praveen Jesudhas University of Texas at Arlington, Department of Electrical Engineering, Nedderman Hall, Room 517, Arlington, TX 76019, USA a r t i c l e i n f o Article history: Received 10 April 2010 Received in revised form 14 June 2014 Accepted 21 August 2014 Communicated by: G. Thimm Available online 11 September 2014 Keywords: Multilayer perceptron Newton's method Hessian Orthogonal least squares Multiple optimal learning factor Whitening transform a b s t r a c t A batch training algorithm is developed for a fully connected multi-layer perceptron, with a single hidden layer, which uses two-stages per iteration. In the first stage, Newton's method is used to find a vector of optimal learning factors (OLFs), one for each hidden unit, which is used to update the input weights. Linear equations are solved for output weights in the second stage. Elements of the new method's Hessian matrix are shown to be weighted sums of elements from the Hessian of the whole network. The effects of linearly dependent inputs and hidden units on training are analyzed and an improved version of the batch training algorithm is developed. In several examples, the improved method performs better than first order training methods like backpropagation and scaled conjugate gradient, with minimal computational overhead and performs almost as well as Levenberg–Marquardt, a second order training method, with several orders of magnitude fewer multiplications. & 2014 Elsevier B.V. All rights reserved. 1. Introduction Multi-layer perceptron (MLP) neural networks are widely used for regression and classification applications in the areas of parameter estimation [1,2], document analysis and recognition [3], finance and manufacturing [4] and data mining [5]. Due to its layered, parallel architecture, the MLP has several favorable properties such as universal approximation [6] and the ability to mimic Bayes discriminant [7] and maximum a-posteriori (MAP) estimates [8]. However, actual MLP performance is adversely affected by the limitations of the available training algorithms. Common batch training algorithms for the MLP include first order methods such as backpropagation (BP) [9] and conjugate gradient [10] and second order learning methods related to New- ton's method. First order methods generally use fewer operations per iteration but require more iterations than second order methods. Newton's method for training the MLP often has non-positive definite [11,12] or even singular Hessians. Hence Levenberg–Mar- quardt (LM) [13,14] and other quasi-Newton methods are used instead. Unfortunately, second order methods do not scale well. Although first order methods scale better, they are sensitive to input means and gains [15], since they lack affine invariance. Layer-by- layer approaches [16] also exist for optimizing the MLP. Such approaches (i) divide the network weights into disjoint subsets, and (ii) train the subsets separately. Some network weight sub- sets have nonsingular Hessians, allowing them to be trained via Newton's method. For example, solving linear equations for output weights [15,17–19] is actually Newton's algorithm for the output weights. The optimal learning factor (OLF) [20] for BP training is found using a one by one Hessian. If the learning factor and momentum term gain in BP are found using a two by two Hessian [21], the resulting algorithm is very similar to conjugate gradient. The primary purpose of this paper is to present a family of learning algorithms targeted towards training a fixed architecture, fully connected multi-layer perceptron with a single hidden layer, capable of learning from regression/approximation type application and data. In [22] a two-stage training method was introduced, which uses Newton's method to obtain a vector of optimal learning factors, one for each MLP hidden unit. These learning factors are optimal hence it was named multiple optimal learning factors (MOLF). A variation to MOLF called the variable optimal learning factors (VOLF) was presented in [23]. As a learning method, VOLF looks promising. However the original MOLF was still better in terms of training performance. In this paper, we explain in detail the motiva- tion behind the original MOLF algorithm using the concept of equivalent networks. We analyze the structure of the MOLF Hessian matrix and demonstrate how linear dependency among inputs and hidden units can impact the structure, ultimately affecting training performance. This leads to the development of an improved version of the MOLF whose performance is not impacted by the presence of linear dependencies and has an overall performance better than the Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing http://dx.doi.org/10.1016/j.neucom.2014.08.043 0925-2312/& 2014 Elsevier B.V. All rights reserved. n Corresponding author. Present address: Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX 76019, USA. Tel.: þ1 817 272 3483. E-mail addresses: Sanjeev.malalur@gmail.com (S.S. Malalur), manry@uta.edu (M.T. Manry). Neurocomputing 149 (2015) 1490–1501
  • 2. original MOLF. The rest of the paper is organized as follows. Section 2 reviews MLP notation, a simple first order training method, and an expression for the OLF. The MOLF method is presented in Section 3 motivated by a discussion of equivalent networks. Section 4 analyzes the effect of linearly dependent inputs and hidden units on MOLF learning. An improvement to the MOLF algorithm is presented in Section 5 that is immune to the presence of linearly dependent inputs and hidden units. In Section 6, our methods are compared to various existing first and second order algorithms. Section 7 summarizes our conclusions. 2. Review of multi-layer perceptron First, we introduce the notation used in the rest of the paper and review a simple two-stage algorithm that makes limited use of Newton's method. Observations from the two-stage method are used to motivate the MOLF approach. 2.1. MLP notation A fully connected feed-forward MLP network with a single hidden layer is shown in Fig. 1. It consists of a layer for inputs, a single layer of hidden units and a layer of outputs. The input weights w(k, n) connect the nth input to the kth hidden unit. Output weights woh(m, k) connect the kth hidden unit's activation op(k) to the mth output yp(m), which has a linear activation. The bypass weight woi(m, n) connects the nth input to the mth output. The training data, described by the set {xp, tp} consists of N-dimensional input vectors xp and M-dimensional desired output vectors, tp. The pattern number p varies from 1 to Nv, where Nv denotes the number of training vectors present in the data set. In order to handle thresholds in the hidden and output layers, the input vectors are augmented by an extra element xp(Nþ1) where, xp(Nþ1)¼1, so xp¼[xp(1), xp(2),…, xp(Nþ1)]T . Let Nh denote the number of hidden units. The vector of hidden layer net functions, np and the output vector of the network, yp can be written as np ¼ W Uxp; yp ¼ Wio U xp þ Woh Uop ð1Þ where the kth element of the hidden unit activation vector op is calculated as op(k)¼f(np(k)) and f(∙) denotes the hidden layer activation function such as the sigmoid activation [9], which is the one used throughout this paper. In this paper, the cost function used in MLP training is the classic mean squared error between the desired and the actual network outputs, defined as E ¼ 1 Nv ∑ Nv p ¼ 1 ∑ M m ¼ 1 ½tp mð ÞÀyp mð ÞŠ2 ð2Þ 2.2. Training using output weight optimization-backpropagation Here, we introduce output weight optimization-backpropagation [17] (OWO-BP), a first order algorithm that groups the weights in the network into two subsets and trains them separately. In a given iteration of OWO-BP, the training proceeds from (i) finding the output weight matrices, Woh and Woi connected to the network outputs to (ii) separately training the input weights W using BP. Output weight optimization (OWO) is a technique for calculating the network's output weights [18]. From (1) it is clear that our MLP's output layer activations are linear. Therefore, finding the weights connected to the outputs is equivalent to solving a system of linear equations. The expression for the actual outputs given in (1) can be re-written as yp ¼ Wo Uxp ð3Þ where xp ¼[xT p, oT p]T is the augmented input column vector or basis vector of size Nu where Nu ¼NþNhþ1. The M by Nu output weight matrix Wo, formed as [Woi: Woh], contains all the output weights. The output weights can be found by setting ∂E/∂Wo¼0 which leads to a set of linear equations given by Cx ¼ Rx UWT o ð4Þ where Cx ¼ 1 Nv ∑ Nv p ¼ 1 xp UtT p Rx ¼ 1 Nv ∑ Nv p ¼ 1 xp UxT p Eq. (4) comprises of M sets of Nu linear equations in Nu unknowns and is most easily solved using orthogonal least squares (OLS) [24]. In the second half of an OWO-BP iteration, the input weight matrix W is updated as W’W þzUG ð5Þ where G is the generic representation of a direction matrix that contains information about the direction of learning and the learning factor, z, contains information about the step length to be taken in the direction G. For backpropagation, the direction matrix is nothing but the Nh by (Nþ1) negative input weight Jacobian matrix computed as G ¼ ∂E ∂W G ¼ 1 Nv ∑ Nv p ¼ 1 δp UxT p ð6Þ Here δp¼[δp(1), δp(2),…, δp(Nh)]T is the Nh by 1 column vector of hidden unit delta functions [9]. A brief outline of training using OWO-BP is given below. For every training iteration, 1. In the first stage, solve the system of linear equations in (4) using OLS and update the output weights, Wo; 2. In the second stage, begin by finding the negative Jacobian matrix G described in Eq. (6); 3. Then, update the input weights, W, using Eq. (5). This method is attractive for a couple of reasons. First, the training is faster than BP, since training weights connected to theFig. 1. A fully connected multi-layer perceptron. S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1491
  • 3. outputs is equivalent to solving linear equations [16] and second, it helps us avoid some local minima. 2.3. Optimal learning factor The choice of learning factor z in Eq. (5) has a direct effect on the convergence rate of OWO-BP. Early steepest descent methods used a fixed constant, with slow convergence, while later methods used a heuristic scaling approach to modify the learning factor between iterations and thus speed up the rate of convergence. However, using Taylor's series for the error E, in Eq. (2), a non- heuristic optimal learning factor (OLF) for OWO-BP can be derived [20] as, z ¼ À∂E=∂z ∂2 E=∂z2 z ¼ 0 ð7Þ where the numerator and denominator derivatives are evaluated at z¼0. The expression for the second derivative of the error with respect to the OLF is found using (2) as, ∂2 E ∂z2 ¼ ∑ Nh k ¼ 1 ∑ Nh j ¼ 1 ∑ N þ 1 n ¼ 1 gðk; nÞU ∑ N þ 1 i ¼ 1 ∂2 E ∂w k; nð Þ∂w j; ið Þ g j; ið Þ ¼ ∑ Nh k ¼ 1 ∑ Nh j ¼ 1 gT k Hk;j R gj ð8Þ where column vector gk contains elements g(k, n) of G, for all values of n. HR is the input weight Hessian with Niw rows and columns, where Niw ¼(Nþ1) Nh is the number of input weights. Clearly, HR is reduced in size compared to H, the Hessian for all weights in the entire network. Hk;j R contains elements of HR for all input weights connected to the jth and kth hidden units and has size (Nþ1) by (Nþ1). When Gauss–Newton [12] updates are used, elements of HR are computed as ∂2 E ∂wðj; iÞ∂w k; nð Þ ¼ 2 Nv u j; kð Þ ∑ Nv p ¼ 1 xp ið Þxp nð Þop' jð Þop' kð Þ u j; kð Þ ¼ ∑ M m ¼ 1 woh m; jð Þwohðm; kÞ ð9Þ o0 p(k) denotes the first partial derivative of op(k) with respect to its net function. Because (9) represents the Gauss–Newton approx- imation to the Hessian, it is positive semi-definite. Eq. (8) shows that (i) the OLF can be obtained from elements of the Hessian HR, (ii) HR contains useful information even when it is singular, and (iii) a smaller non-singular Hessian ∂2 E/∂z2 can be constructed using HR. 2.4. Motivation for further work Even though the OWO stage of OWO-BP is equivalent to Newton's algorithm for the output weights, the overall algorithm is only marginally better than standard BP, and so it is rarely used. This occurs because the BP stage is first order. The one by one Hessian used in the OLF calculations is the only one associated with the input weight matrix W. Therefore, improving the perfor- mance of OWO-BP may require that we develop a larger Hessian matrix associated with W. 3. Multiple optimal learning factor algorithm In this section, we attempt to improve input weight training by deriving a vector of optimal learning factors, z, which contains one element for each hidden unit. The resulting improvement to OWO-BP is called the multiple optimal learning factors (MOLF) training method. We begin by first introducing the concept of equivalent networks and identify the conditions under which two networks can generate the same outputs. 3.1. Discussion of equivalent networks Let MLP 1 have a net function vector np and output vector yp as discussed previously in Section 2.1. In a second network called MLP 2, which is trained similarly, the net function vector and the output vector are respectively denoted by np' and yp'. The bypass and output weight matrices along with the hidden unit activation functions are considered to be equal for both the MLP 1 and MLP 2. Based on this information the output vectors for MLP 1 and MLP 2 are respectively, yp ið Þ ¼ ∑ N þ1 n ¼ 1 woi i; nð Þxp nð Þþ ∑ Nh k ¼ 1 woh i; kð Þf ðnp kð ÞÞ ð10Þ yp' ið Þ ¼ ∑ N þ 1 n ¼ 1 woi i; nð Þxp nð Þþ ∑ Nh k ¼ 1 woh i; kð Þf np' kð Þ À Á ð11Þ In order to make MLP 2 strongly equivalent to MLP 1 their respective output vectors yp and yp' have to be equal. Based on Eqs. (10) and (11) this can be achieved only when their respective net function vectors are equal [25]. MLP 2 can be made strongly equivalent to MLP 1 by linearly transforming its net function vector np' before passing it as an argument to the activation function f(). This process is equivalent to directly using the net function np in MLP 2, denoted as, np ¼ C Unp' ð12Þ The linear transformation matrix C used to transform np' in Eq. (12) is of size Nh-by-Nh. This process of transforming the net function vector of MLP 2 is represented in Fig. 2. On applying the linear transformation to the net function vector n0 p in Eq. (12), MLP 2 is made strongly equivalent to MLP1. We then get, yp' ið Þ ¼ yp ið Þ ¼ ∑ N þ1 n ¼ 1 woi i; nð Þxp nð Þþ ∑ Nh k ¼ 1 woh i; kð Þf ∑ Nh m ¼ 1 c k; mð Þnp' mð Þ ! ð13Þ After making MLP 2 strongly equivalent to MLP 1, the net function vector np can be related to its input weights W and the net function vector np' as, np kð Þ ¼ ∑ N þ1 n ¼ 1 w k; nð Þxp nð Þ ¼ ∑ Nh m ¼ 1 c k; mð Þnp' ðmÞ ð14Þ Similarly the MLP 2 net function vector np' can be related to its weights W0 as np' mð Þ ¼ ∑ N þ 1 n ¼ 1 w' m; nð Þxp nð Þ ð15Þ Fig. 2. Graphical representation of net function vectors. S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011492
  • 4. On substituting (15) into (14) we get, npðkÞ ¼ ∑ N þ 1 n ¼ 1 wðk; nÞxpðnÞ ¼ ∑ N þ 1 n ¼ 1 ∑ Nh m ¼ 1 cðk; mÞUw'ðm; nÞxpðnÞ ð16Þ In (16) we see that the input weight matrices of MLPs 1 and 2 are linearly related as W ¼ C UW' ð17Þ If the elements of the C matrix are found through an optimality criterion, then optimal input weights W0 can be computed as W' ¼ C À 1 UW ð18Þ 3.2. Transformation of net function vectors In Section 3.1 we showed a method for generating a network, MLP 2, which is strongly equivalent to the original network, MLP 1. In this section, we show that equivalent networks train differently. The output of the transformed network is given in Eq. (14). The elements of the negative Jacobian matrix G' of MLP 2 found from Eqs. (2) and (14) are defined as, g' u; vð Þ ¼ À ∂E ∂w' u; vð Þ ¼ 2 Nv ∑ Nv p ¼ 1 ∑ M i ¼ 1 tp ið ÞÀyp ið Þ h i U ∂yp ið Þ ∂w' u; vð Þ ∂yp ið Þ ∂w' u; vð Þ ¼ ∑ Nh k ¼ 1 woh i; kð Þop' kð Þc k; uð ÞxpðvÞ ð19Þ o0 p(k) denotes the first partial derivative of op(k). Rearranging terms in (19) results in, g'ðu; vÞ ¼ ∑ Nh k ¼ 1 cðk; uÞ 2 Nv ∑ Nv p ¼ 1 ∑ M i ¼ 1 ½tpðiÞÀypðiÞŠU wohði; kÞo'pðkÞxpðvÞ ¼ ∑ Nh k ¼ 1 cðk; uÞ À∂E ∂wðk; vÞ ð20Þ This is abbreviated as G' ¼ CT UG ð21Þ The input weight update equation for MLP 2 based on its Jacobian vector G0 is given by, W' ¼ W' þzUG' On pre-multiplying this equation by C and using Eqs. (18) and (21) we obtain W ¼ W þzUG'' ð22Þ where, G'' ¼ RUG R ¼ C UCT ð23Þ Lemma 1. For a given R matrix in Eq. (23), the number of C matrices is infinite. Continuing, the transformed Jacobian G'' of MLP 1 is given in terms of the original Jacobian G as, g''ðu; vÞ ¼ ∑ Nh k ¼ 1 rðu; kÞgðk; vÞ gðk; vÞ ¼ À∂E ∂wðk; vÞ ¼ 2 Nv ∑ Nv p ¼ 1 ∑ M i ¼ 1 ½tpðiÞÀypðiÞŠUwohði; kÞo0 pðkÞxpðvÞ ð24Þ where o'p(k) denotes the derivative of op(k) with respect to its net function. Eqs. (22) and (23) show that valid input weight update matrices G'' can be generated by simply transforming the weight update matrix G from MLP 1, using an autocorrelation matrix R. Since uncountably many matrices R exist, it is natural to wonder which one is optimal. 3.3. Multiple optimal learning factors for input weights In this subsection, we explore the idea of finding an optimal transform matrix R for the negative Jacobian G. However, finding a dense R matrix would result in an algorithm with almost the computational complexity of Levenberg–Marquardt (LM). We avoid this by letting the R matrix in Eq. (23) be diagonal. Under this case Eq. (22) becomes, w k; nð Þ ¼ w k; nð ÞþzUr kð Þg k; nð Þ ð25Þ On comparing Eqs. (22) and (25), the expression for the different optimal learning factors zk could be given as, zk ¼ zUr kð Þ ð26Þ Eq. (26) shows that using the MOLF approach for training a MLP is equivalent to transforming the net function vector using a diag- onal transform matrix. Next, we use Newton's method develop our approach for finding z. Assume that an MLP is being trained using OWO-BP. Also assume that a separate OLF zk is being used to update each hidden unit's input weights, w(k, n), where 1rnr(Nþ1). The error function to be minimized is given by (2). The predicted output yp(m) is given by, ypðmÞ ¼ ∑ N þ 1 n ¼ 1 woiðm; nÞxpðnÞ þ ∑ Nh k ¼ 1 wohðm; kÞf ∑ N þ 1 n ¼ 1 ðwðk; nÞþzk Á gðk; nÞÞxpðnÞ where, g(k, n) is an element of the negative Jacobian matrix G and f() denotes the hidden layer activation function. The negative first partial of E with respect to zj is gmolf jð Þ ¼ À ∂E ∂zj ¼ 2 Nv ∑ Nv p ¼ 1 ∑ M m ¼ 1 tp mð ÞÀ ∑ Nh k ¼ 1 woh m; kð Þop zkð Þ # Uwoh m; jð Þop' jð ÞΔnpðjÞ ð27Þ where, tp mð Þ ¼ tp mð ÞÀ ∑ N þ 1 n ¼ 1 woi m; nð Þxp nð Þ; Δnp jð Þ ¼ ∑ N þ 1 n ¼ 1 xp nð ÞUg j; nð Þ op zkð Þ ¼ f ∑ N þ 1 n ¼ 1 ðw k; nð Þþzk Ug k; nð ÞÞxp nð Þ The elements of the Gauss–Newton Hessian Hmolf are hmolf l; jð Þ % ∂2 E ∂zl∂zj ¼ 2 Nv ∑ M m ¼ 1 woh m; lð Þwoh m; jð Þ ∑ Nv p ¼ 1 op' lð Þop' jð ÞΔnp lð ÞΔnp jð Þ ¼ ∑ N þ 1 i ¼ 1 ∑ N þ 1 n ¼ 1 2 Nv u l; jð Þ ∑ Nv p ¼ 1 xp ið Þxp nð Þop' lð Þop' jð Þ # g l; ið ÞUg j; nð Þ ð28Þ where u(l, j) is defined in (9). The Gauss–Newton update guaran- tees that Hmolf is non-negative definite. Given Hmolf and the negative gradient vector gmolf, defined as gmolf ¼ ½À∂E=∂z1; À∂E=∂z2; …; À∂E=∂zNh ŠT We minimize E with respect to the vector z using Newton's method. If Hmolf and gmolf are the Hessian and negative gradient, respectively, of the error with respect to z, then the multiple optimal learning factors are computed by solving Hmolf Uz ¼ gmolf ð29Þ S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1493
  • 5. A brief outline of training using the MOLF algorithm is given below. In each iteration of the training algorithm: 1. Solve linear equations for all output weights using (4) 2. Calculate the negative input weight Jacobian G using BP as given by (6) 3. Calculate z using Newton's method as shown in (29) and update the input weights as w k; nð Þ’w k; nð Þþzk Ug k; nð Þ ð30Þ Here, the MOLF procedure has been inserted into the OWO-BP algorithm, resulting in an algorithm we denote as MOLF-BP. The MOLF procedure can be inserted into other algorithms as well. 3.4. Analysis of the MOLF Hessian This subsection presents an analysis of the structure of the MOLF Hessian and its relation to the reduced Hessian HR of the input weights. Comparing (28) to (9), we see that the term within the square brackets is an element of HR. Hence, hmolf l; jð Þ ¼ ∂2 E ∂zl∂zj ¼ ∑ N þ 1 i ¼ 1 ∑ N þ1 n ¼ 1 ∂2 E ∂w l; ið Þ∂w j; nð Þ g l; ið Þg j; nð Þ ð31Þ For fixed (l, j), hmolf(l, j) can be expressed as, ∂2 E ∂zl∂zj ¼ ∑ N þ1 i ¼ 1 gl ið Þ ∑ N þ 1 n ¼ 1 h l;j R i; nð ÞUgj nð Þ ð32Þ In matrix notation, Hmolf ¼ gT l Hl;j R gj ð33Þ where column vector gl contains G elements g(l, n) for all values of n, where the (Nþ1) by (Nþ1) matrix HR l, j contains elements of HR for weights connected to the lth and jth hidden units. The reduced size Hessian HR l, j in (33) uses four indices (l, i, j, n) and can be viewed as a 4-dimensional array, represented by H4 R whose elements satisfy h l;j R ði; nÞARðN þ 1ÞxNhxNhxðN þ 1Þ Now, (33) can be rewritten as H4 molf ¼ GH4 RGT ð34Þ From (32), we see that hmolf(l, j)¼h4 molf(l, j, l, j), i.e., the 4-dimensional matrix H4 molf is transformed into the 2-dimensional matrix Hmolf, by setting m¼l and u¼j. Each element of the MOLF Hessian combines the information from (Nþ1) rows and columns of the reduced Hessian, HR l, j . This makes MOLF less sensitive to input conditions. Note the similarities between (8) and (33). 4. Effect of dependence on the MOLF algorithm In this section, we analyze the performance of MOLF applied to OWO-BP, in the presence of linear dependence. 4.1. Dependence in the input layer A linearly dependent input can be modeled as a linear combi- nation of any or all inputs as xp Nþ2ð Þ ¼ ∑ N þ 1 n ¼ 1 b nð Þxp nð Þ ð35Þ During OWO, the weights from the dependent input, feeding the outputs will be set to zero and the output weight adaptation will not be affected. During the input weight adaptation, the expression for gradient given by (27) can be re-written as, gmolf jð Þ ¼ À ∂E ∂zj ¼ 2 Nv ∑ Nv p ¼ 1 ∑ M m ¼ 1 tp mð ÞÀ ∑ Nh k ¼ 1 woh m; kð Þop zkð Þ # Uwoh m; jð Þop' jð Þ Δnp jð Þþg k; Nþ2ð Þ ∑ N þ 1 n ¼ 1 b nð Þxp nð Þ ! ð36Þ and the expression for an element of the Hessian in (28) can be re-written as ∂2 E ∂zl∂zj ¼ 2 Nv ∑ M m ¼ 1 woh m; lð Þwoh m; jð Þ ∑ Nv p ¼ 1 op' lð Þop' jð Þ Â Δnp lð ÞΔnp jð Þþ Δnp lð Þg j; Nþ2ð Þ Â ∑ N þ 1 n ¼ 1 b nð Þxp nð ÞþΔnp jð Þg l; Nþ2ð Þ ∑ N þ 1 i ¼ 1 b ið Þxp ið Þ þg j; Nþ2ð Þg l; Nþ2ð Þ ∑ N þ1 n ¼ 1 ∑ N þ1 i ¼ 1 b ið Þxp ið Þb nð Þxp nð Þ # ð37Þ Let H0 molf be the Hessian, when the extra dependent input xp(Nþ2) is included. Then, hmolf' l; jð Þ ¼ hmolf l; jð Þþ 2 Nv u l; jð Þ ∑ Nv p ¼ 1 xp Nþ2ð Þop' lð Þop' jð Þ Â Ã U g l; Nþ2ð Þ ∑ N þ 1 n ¼ 1 xp nð Þg j; nð Þþg j; Nþ2ð Þ ∑ N þ 1 i ¼ 1 xp ið Þg l; ið Þ þxp Nþ2ð Þg l; Nþ2ð Þg j; Nþ2ð Þ # ð38Þ Comparing (27) with (36) and (28) with (37) and (38), we see some additional terms that appear within the square brackets in the expressions for gradient and Hessian in the presence of linearly dependent input. Clearly, these parasitic cross-terms will cause the training using MOLF to be different for the case of linearly dependent inputs. This leads to the following lemma. Lemma 2. Linearly dependent inputs, when added to the network, do not force H'molf to be singular. As seen in (37) and (38), each element h0 molf (m, j) simply gains some first and second degree terms in the variables b(n). 4.2. Dependence in the hidden layer Here we look at how a dependent hidden unit affects the performance of MOLF. Assume that some hidden unit activations are linearly dependent upon others, as opðNh þ1Þ ¼ ∑ Nh k ¼ 1 c kð Þop kð Þ ð39Þ Further assume that OLS is used to solve for output weights. The weights in the network are updated during every training iteration and it is quite possible that this could cause some hidden units to be linearly dependent upon each other. The dependence could manifest in hidden units being identical or a linear combi- nation of other hidden unit outputs. Consider one of the hidden units to be dependent. The autocorre- lation matrix, Rx, will be singular and since we are using OLS to solve for output weights, all the weights connecting the dependent hidden unit to the outputs will be forced to zero. This will ensure that the dependent hidden unit does not contribute to the output. To see if this has any impact on learning in MOLF we can look at the expression for the gradient and Hessian, given by (27) and (28) respectively. Both equations have a sum of product terms containing output weights. Since OWO sets the output weights for the dependent hidden unit to be zero, this will also set the corresponding gradient and Hessian S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011494
  • 6. elements to be zero. In general, any dependence in the hidden layer will cause the corresponding learning factor to be zero and will not affect the performance of MOLF. Lemma 3. For each dependent hidden unit, the corresponding row and column of Hmolf is zero-valued. This follows from (28). The zero-valued rows and columns in Hmolf cause no difficulties if (16) is solved using OLS [26]. 5. Improving the MOLF algorithm Section 4.1 showed that the presence of linearly dependent inputs affect MOLF training, primarily the input weight adaptation. Depen- dent inputs manifest themselves as additional rows and columns in the input weight Jacobian matrix G. If we can find a way to set these additional rows and columns to zero, then we can ensure that the weights connected to the dependent inputs will not get updated and the training will be unaffected by the presence of dependent inputs. In this section, we present a way to mitigate the effect of linearly dependent inputs on training input weights. Without loss of generality, let a linearly dependent input xp(Nþ2) be modeled as in (35), where the first (Nþ1) inputs are linearly independent. Using the singular value decomposition (SVD), the input autocorrelation matrix R of dimension (Nþ2) by (Nþ2) is written as UΣUT . This can be rewritten as R¼AT A, where the transformation matrix A is defined as A ¼ ΣÀ1=2 UT Both the last row and the last column of A are zero. Now, when the input vector with the dependent inputs is transformed as, x0 p ¼ A Á xp ð40Þ the last element, x0 p(Nþ2) will be zero. This type of transformation is termed as whitening transformation [27] in linear algebra and statistics literature. Now, let us investigate the effect of dependent inputs on adjusting input weights using MOLF, when whitening transform is incorporated into training. We need to verify that the transformed input vector x0 p, with zero-values for the dependent inputs, causes no problems. The Nh by (Nþ2) gradient matrix G0 for the transformed data is G' ¼ G Á AT ð41Þ where G is the negative Jacobian matrix before the inputs are transformed. Although the last column of G is dependent upon the other columns, the whitening transform causes the last column of G0 to be zero. The fact that x'p(Nþ2) and g(k, Nþ2) are zero causes (36) to revert to (27) and H0 molf in (38) to equal Hmolf. When OLS is used to find z and then the output weights, xp(Nþ2) has no effect, and weights connected to it are not trained. Eqs. (40) and (41) now provide us with a way to remove line- arly dependent inputs during training, so their presence has no effect. This leads us to develop an improved version of MOLF. The transformation applied to the Jacobian matrix is a whitening transformation (WT). Whitening transformation has two desirable properties: (i) decorrelation, which causes the transformed matrix to have a diagonal covariance matrix and (ii) whitening, which forces the transformed matrix to be a unit covariance (identity) matrix. We call the improved algorithm MOLF-WT. The steps to training an MLP with MOLF-WT are listed below: 1. Compute the input autocorrelation matrix R as given by (4) 2. Use the SVD to compute the whitening transformation matrix A as described above 3. In each iteration of training a. Calculate the negative input weight Jacobian G using BP as given by (6) b. Calculate the whitening transformed Jacobian G0 as described in (41) to eliminate any linear dependency in the inputs c. Calculate z using Newton's method as before, using (29) and update the input weights as described in (30), except that G is replaced by G0 d. Solve linear equations for all output weights as before using (4) 6. Performance comparison and results This section compares the performance of MOLF-WT, with several popular existing feed-forward batch learning algorithms. We begin by listing the various algorithms being compared. We then identify the metrics used for comparing the performances of the various algorithms, provide details of the experimental procedure along with a description of the data files used for obtaining the numerical results. 6.1. List of competing algorithms The performance of MOLF-WT is compared with four other existing methods, namely: 1. Output weight optimization-backpropagation (OWO-BP) 2. Multiple Learning Factor applied to OWO-BP (MOLF-BP) 3. Levenberg–Marquardt (LM) 4. Scaled conjugate gradient (SCG) [28] All algorithms employ the fully connected feed-forward net- work architecture described in Fig. 1. The hidden units have sigmoid activation functions, while the output units all have linear activations. SCG and LM employ one-stage learning where all weights in the network are updated simultaneously in each iteration. OWO-BP, MOLF-BP, and MOLF-WT employ two-stage learning where we first update the input weights and subse- quently solve linear equations for the output weights. A single OLF was used for training in the OWO-BP and LM algorithms, while multiple OLFs were used for MOLF-BP and MOLF-WT. 6.2. Experimental procedure Primarily, we make use of the k-fold methodology for our experi- ments (k¼10 for our simulations). Each data set is split into k non- overlapping parts of equal size. Of these, k-2 parts (roughly 80%) are combined and used for training, while each of the remaining two parts is used for validation and testing respectively (roughly 10% each). Validation is performed per training iteration (to prevent over train- ing) and the network with the minimum validation error is saved. After the training is completed, the network with the minimum validation error is used with the testing data to obtain the final testing error, which measures the networks ability to generalize. The max- imum number of iteration per training session is fixed at 2500 and is the same for all algorithms. There is also an early stopping criterion. Training is repeated till all k combinations have been exhausted (i.e.,10 times). It is known that the initial network plays a role in the networks ability to learn. In order to make our performance comparisons independent of the initial network, we carry out the k-fold methodology 5 times, each time with a different initial network. This leads to a total of 50 different training sessions per algorithm, with 50 different initial networks being used. We call this the 5k-fold method. For a given data set, even though the initial weight S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1495
  • 7. is different in each training session, we use the same 50 initial weight sets across all algorithms. This is done to allow for a fair comparison between the various training methods. We use net control [19] to adjust the weights before training begins. In net control, the randomly initialized input weights and hidden unit thresholds are scaled and biased so that each hidden unit's net function has the same mean and standard deviation. Specifically, the net function means are equal to 0.5 and the corre- sponding standard deviations are equal to 1. In addition, OWO is used to find the initial output weights. This means that all the algorithms listed above have the same starting point (a different one for each session) for all data files. This will be evident in the plots presented later in this section, where we can see that the MSE for the first iteration for all algorithms is the same. At the end of the 5k-fold procedure, we compute (i) the average training error, (ii) the average minimum validation error, (iii) the average testing error, and (iv) the average number of cumulative multiplies required for training, for each algorithm and for each data file. The average is computed over 50 sessions. These quantities form the metrics for comparison and are subsequently used to generate the plots and tables to compare the performance of different learning algorithms. 6.3. Computational cost The proposed MOLF algorithm involves inversion a Hessian matrix, which has a general complexity of O(N3 ), for a square matrix of size N. However, compared to Newton's method or LM, the size of the Hessian is much smaller. Updating weights using Newton's method or LM, requires a Hessian with Nw rows and columns, where Nw ¼M Á Nu þ(Nþ1)Nh and Nu ¼NþNh þ1. In com- parison, the Hessian used in the proposed MOLF has only Nh rows and columns. The number of multiplies required to solve for output weights using OLS is given by Mols ¼ Nu Nu þ1ð Þ Mþ 1 6 Nu 2Nu þ1ð Þþ 3 2 ! ð42Þ The numbers of multiplies per training iteration for OWO-BP, LM and MOLF are given below Mowo Àbp ¼ Nv½ 2Nh Nþ2ð ÞþMðNu þ1Þ þðNu Nu þ1ð ÞÞ=2þM Nþ6Nh þ4ð ÞŠ þMols þNh Nþ1ð Þ ð43Þ Mlm ¼ Nv½MNu þ2Nh Nþ1ð ÞþM Nþ6Nh þ4ð ÞþMNu Nu þ3Nh Nþ1ð Þð Þ þ4N2 h Nþ1ð Þ2 ŠþN3 w þN2 w ð44Þ Mscg ¼ 4Nv Nh Nþ1ð ÞþMNu½ Šþ10 Nh Nþ1ð ÞþMNu½ Š ð45Þ Mmolf ¼ MowoÀbp þNv Nh Nþ4ð ÞÀM Nþ6Nh þ4ð Þ½ ŠþN3 h ð46Þ Note that Mmolf consists of Mowo-bp plus the required multiplies for calculating optimal learning factors. Similarly, Mlm consists of Mbp plus the required multiplies for calculating and inverting the Hessian matrix. We will see later in the plots that despite the additional complexity of Hessian inversion, the number of cumu- lative multiplies to train a network using MOLF is almost the same as OWO-BP or SCG algorithms, which do not involve any matrix inversion operation. We will also see that LM on the other hand has a far greater overhead due the inversion of a much larger Hessian matrix. 6.4. Training data sets Table 1 lists seven data sets used for simulations and perfor- mance comparisons. All the data sets are publicly available. In all data sets, the inputs have been normalized to be zero-mean and unit variance. This way, it becomes clear that MOLF is not equivalent to a mere normalization of the data. The number of hidden units for each data set is selected by first training a multi-layer perceptron with a large number of hidden units followed by a step-wise pruning with validation [29]. The least useful hidden unit is removed at each step till there is only one hidden unit left. The number of hidden units corresponding to the smallest validation error is used for training on that data set. 6.5. Results Numerical results obtained from our training, validation and testing methodology for the proposed MOLF-WT algorithm are compared with those of OWO-BP, MOLF-BP, LM and SCG, using the data files listed in Table 1. Two plots are generated for each data file: (i) the average mean square error (MSE) for training from 10-fold training is plotted versus the number of iterations (shown on a log10 scale) and (ii) the average training MSE from 10-fold training is plotted versus the cumulative number of multiplies (also shown on a log10 scale) for each algorithm. Together, the two plots will not only show the final average MSE achieved by training but also the computations it consumed for getting there. 6.5.1. Prognostics data set This data file is available at the Image Processing and Neural Networks Lab repository [30]. It consists of parameters that are available in the Bell Helicopter health and usage monitoring system (HUMS), which performs flight load synthesis, which is a form of prognostics [31]. The data set containing 17 input features and 9 output parameters. We trained an MLP having 13 hidden units. In Fig. 3-a, the average mean square error (MSE) for 10-fold training is plotted versus the number of iterations for each algorithm. In Fig. 3-b, the average 10-fold training MSE is plotted versus the required number of multiplies (shown on a log10 scale). From Fig. 3-a, LM and MOLF based algorithms have almost identical performance (LM is margin- ally better than MOLF-WT which in turn is marginally better than MOLF-BP). However, the performance of LM comes with a signifi- cantly higher computational demand, as shown in Fig. 3-b. Also, MOLF based algorithms are better than OWO-BP and SCG. 6.5.2. Federal reserve economic data set This file contains some economic data for the USA from 01/04/ 1980 to 02/04/2000 on a weekly basis. From the given features, the goal is to predict the 1-Month CD Rate [32]. For this data file, called TR on its webpage [33], we trained an MLP having 34 hidden units. From Fig. 4-a, LM has the best overall performance, followed closely by MOLF-WT. The MOLF-WT algorithm is better than MOLF-BP. Fig. 4-b shows the computational cost of achieving this Table 1 Data set description. Data set name No. of inputs No. of outputs No. of patterns No. of Hidden Units Prognostics 17 9 4745 13 Federal reserve 15 1 1049 34 Housing 16 1 22,784 30 Concrete 8 1 1030 13 Remote sensing 16 3 5992 25 White wine quality 11 1 4898 24 Parkinson's 16 2 5875 17 S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011496
  • 8. performance. The proposed MOLF algorithms consume slightly more computation than OWO-BP and SCG. However, all the first order algorithms utilized about two orders of magnitude fewer computations than LM. 6.5.3. Housing data set This data file is available on the DELVE data set repository [34]. The data was collected as part of the 1990 US census. For the purpose of this data set a level State-Place was used. Data from all states were obtained. Most of the counts were changed into appropriate proportions. These are all concerned with predicting the median price of houses in a region based on demographic composition and the state of the housing market in the region. House-16H data was used in our simulation, with ‘H’ standing for high difficulty. Tasks with high difficulty have had their attributes chosen to make the modeling more difficult due to higher variance or lower correlation of the inputs to the target. We trained an MLP having 30 hidden units. From Figs. 5-a and -b, we see that the LM had the best performance, followed closely by SCG. The proposed MOLF-WT algorithm has a slight edge over MOLF-BP. 6.5.4. Concrete compressive strength data set This data file is available on the UCI Machine Learning Repo- sitory [35]. It contains the actual concrete compressive strength (MPa) for a given mixture under a specific age (days) determined from laboratory. The concrete compressive strength is a highly nonlinear function of age and ingredients. We trained an MLP having 13 hidden units. For this data set, Fig. 6 shows that LM has the best overall average training error, followed very closely by the improved MOLF-WT, which is marginally better than MOLF-BP. SCG attains a performance almost identical to MOLF-WT. 6.5.5. Remote sensing data set This data file is available on the Image Processing and Neural Networks Lab repository [30]. It consists of 16 inputs and 3 outputs and represents the training set for inversion of surface permittiv- ity, the normalized surface rms roughness, and the surface correlation length found in back scattering models from randomly rough dielectric surfaces [36]. We trained an MLP having 25 hidden units. From Fig. 7-a, the MOLF-WT has the best overall average training MSE. Fig. 4. Federal reserve data: average error vs. (a) iterations and (b) multiplies per iteration. Fig. 3. Prognostics data set: average error vs. (a) iterations and (b) multiplies per iteration. S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1497
  • 9. 6.5.6. White wine quality data set The Wine Quality-White data set [37] was created using white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). This data set is available on the UCI data repository [35]. For this data file, the number of hidden units was chosen to be 24. For this data set, Fig. 8 shows that LM and the improved MOLF-WT have almost identical performance and are better than the rest. 6.5.7. Parkinson's disease data set The Parkinson's data set [38] is available on the UCI data repository [35]. It is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson's disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. The main aim of the data is to predict the motor and total UPDRS scores (‘motor_UPDRS’ and ‘total_UPDRS’) from the 16 voice measures. For this data set the number of hidden units was chosen to be 17. From Fig. 9, LM attains the best overall training error, followed by the improved MOLF-WT. We can see again that the improved MOLF-WT performs better than MOLF-BP. Table 2 is a compilation of the metrics used for comparing the performance of all algorithms on all data sets. It comprises of the average minimum testing and validation errors. Based on the results obtained on the selected data sets, we can make a few observations. From the training plots and the data from Table 2, we can say that 1. LM is the top performer in 4 out of the 7 data sets. However, LM being a second order method, this performance comes at a significant cost of computation – almost two orders of magni- tude greater than the rest; 2. Both SCG and OWO-BP consistently appear in the last two places in terms of average training error on 5 out of 7 data sets (Housing and Concrete data sets being the exception). How- ever, being first order methods, they set the bar for the lowest cost of computation; 3. Both MOLF-BP and MOLF-WT always perform better than their common predecessor, OWO-BP, on all data sets and they are Fig. 5. Housing data: average error vs. (a) iterations and (b) multiplies per iteration. Fig. 6. Concrete data: average error vs. (a) iterations and (b) multiplies per iteration. S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011498
  • 10. better than SCG except for the housing data set. This perfor- mance is achieved with minimal computational overhead compared to both SCG and OWO-BP, as evident in the plots of error vs. number of cumulative multiplications; 4. MOLF-BP is almost never better than LM, except for the remote sensing data set where it is marginally better; 5. MOLF-WT always performs better than its immediate prede- cessor MOLF-BP (although MOLF-BP comes very close on 3 of the 7 data sets); 6. MOLF-WT achieves a better performance than LM on one data set (remote sensing) and has almost identical performance on two other data sets (wine and prognostics). It is also a close second on 3 of the remaining 4 data sets. The same cannot be said of MOLF-BP; 7. LM is again consistently the top performer with the best average testing error on 6 out 7 data sets, while MOLF based algorithms figure consistently in the top two performers (note that the network with the smallest validation error is used to obtain the testing error). In general, we can see that 1. Both MOLF-BP and the improved MOLF-WT algorithms con- sistently outperform OWO-BP in all three phases of learning (i.e. MOLF inserted into OWO-BP does help improve its perfor- mance in training, validation and testing); 2. Both MOLF-BP and the improved MOLF-WT algorithms con- sistently outperform SCG in terms of the average minimum testing error; 3. The MOLF-WT very often has a performance comparable to LM and it attains that with minimal computational overhead; 4. While MOLF-BP is an improvement over OWO-BP, it is not as good as MOLF-WT. 7. Conclusions Starting with the concept of equivalent networks and its implications, a class of hybrid learning algorithms is presented Fig. 7. Remote sensing data: average error vs. (a) iterations and (b) multiplies per iteration. Fig. 8. Wine quality data: average error vs. (a) iterations and (b) multiplies per iteration. S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1499
  • 11. that uses first and second order information to simultaneously calculate optimal learning factors for all hidden units and train the input weights. Detailed analysis of the structure of the Hessian matrix for the proposed algorithm in relation to the full Newton Hessian matrix is presented, showing how the smaller MOLF Hessian could potentially be less ill-conditioned. Elements of the reduced size Hessian are shown to be weighted sums of the elements of the entire network's Hessian. Limitations of the basic MOLF procedure in the presence of linearly dependent inputs is shown mathematically. The limitations are used as a launch pad to develop a version of MOLF that incorporates a whitening trans- formation type operation to mitigate the effect of linearly depen- dent inputs. This version is called MOLF-WT. A detailed experimental procedure is presented and the perfor- mance of MOLF-BP and MOLF-WT are compared to 3 other existing feed-forward batch learning algorithms on seven publicly available data sets. For the data sets investigated, both MOLF-BP and MOLF-WT are about as computationally efficient per iteration as OWO-BP and SCG. MOLF-BP consistently performs better than OWO-BP. In effect the MOLF procedure improves OWO-BP by using an Nh by Nh Hessian in place of the normal 1 by 1 Hessian used to calculate a scalar OLF. MOLF-WT has a performance comparable and in some cases better than that of LM. From the results, MOLF-WT stays close to the computational cost of OWO-BP and SCG to achieve a performance comparable with LM and may be a viable alternative to LM in terms of efficiency of training, minimum validation error and testing errors. Since they use smaller Hessian matrices, the MOLF methods presented in this paper require fewer multiplies per iteration than traditional second order methods based on Newton's method. Finally, it is clear from the figures that MOLF-BP and MOLF-WT decrease the training error much faster than the other algorithms. In future work, the MOLF approach will be applied to other training algorithms and to networks having additional hidden layers. An alternate approach, in which each network layer has its own optimal learning factor, will also be tried. In addition, the MOLF algorithm will be extended to handle classification type applications. References [1] R.C. Odom, P. Pavlakos, S.S. Diocee, S.M. Bailey, D.M. Zander, J.J. Gillespie, Shaly sand analysis using density-neutron porosities from a cased-hole pulsed neutron system, SPE Rocky Mountain Regional Meeting Proceedings: Society of Petroleum Engineers, 1999, pp. 467–476. [2] A. Khotanzad, M.H. Davis, A. Abaye, D.J. Maratukulam, An artificial neural network hourly temperature forecaster with applications in load forecasting, IEEE Trans. Power Syst. 11 (2) (1996) 870–876. [3] S. Marinai, M. Gori, G. Soda, Artificial neural networks for document analysis and recognition, IEEE Trans. Pattern Anal. Mach. Intell. 27 (1) (2005) 23–35. [4] J. Kamruzzaman, R.A. Sarker, R. Begg, Artificial Neural Networks: Applications in Finance and Manufacturing, Idea Group Inc (IGI), Hershey, PA, 2006. Fig. 9. Parkinson's disease data: average error vs. (a) iterations and (b) multiplies per iteration. Table 2 Statistics of testing and validation errors. Comparison metric Data file name OWO-BP SCG MOLF-BP MOLF-WT LM Average minimum testing error Prognostics 2.4187E7 2.6845E7 1.5985E7 1.4973E7 1.4509E7 Federal reserve 0.108636974 0.33502068 0.105951897 0.120668845 0.103284696 Housing 1.7329E9 1.7441E9 1.2758E9 1.3599E9 1.1627E9 Concrete 33.48613532 83.78832466 32.57090477 33.72180127 29.73279111 Remote sensing 1.568023633 1.51843944 2.189078998 0.136484007 0.637828985 White wine 0.513739373 1.25867336 0.507925435 0.514729807 0.499294581 Parkinson's 139.0323995 125.5411318 129.4456136 127.3415428 122.2485104 Average minimum validation error Prognostics 2.4764E7 2.6675E7 1.5999E7 1.5155E7 1.4329E7 Federal reserve 0.108442923 0.17653648 0.09675038 0.10759092 0.09371866 Housing 1.3512E9 1.1683E9 1.2414E9 1.2109E9 1.0662E9 Concrete 33.43842214 48.63268454 29.52707558 28.72108048 26.4212992 Remote sensing 1.534430917 0.71842796 0.57845174 0.15130356 0.37738708 White wine 0.503644249 0.50386288 0.49644024 0.49550544 0.49058676 Parkinson's 131.2226476 124.3035235 124.0881855 122.3187088 117.3367678 S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–15011500
  • 12. [5] L. Wang, X. Fu, Data Mining With Computational Intelligence, Springer-Verlag, New York, Inc, Seacaucus, NJ, USA, 2005. [6] G. Cybenko, Approximations by superposition of a sigmoidal function, Math. Control. Signals Syst. (MCSS) 2 (1989) 303–314. [7] D.W. Ruck, et al., The multi-layer perceptron as an approximation to a Bayes optimal discriminant function, IEEE Trans. Neural Netw. 1 (4) (1990) 296–298. [8] Q. Yu, S.J. Apollo, M.T. Manry, MAP estimation and the multi-layer perceptron, in: Proceedings of the 1993 IEEE Workshop on Neural Networks for Signal Processing. Linthicum Heights, Maryland, September 6–9, 1993, pp. 30–39. [9] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, in: D.E. Rumelhart, J.L. McClelland (Eds.), Parallel Distrib- uted Processing, vol. I, The MIT Press, Cambridge, Massachusetts, 1986. [10] J.P. Fitch, S.K. Lehman, F.U. Dowla, S.Y. Lu, E.M. Johansson, D.M. Goodman, Ship wake-detection procedure using conjugate gradient trained artificial neural networks, IEEE Trans. Geosci. Remote Sens. 29 (5) (1991) 718–726. [11] S. McLoone, G. Irwin, A variable memory quasi-Newton training algorithm, Neural Process. Lett. 9 (1999) 77–89. [12] A.J. Shepherd, Second-Order Methods for Neural Networks, Springer-Verlag, New York, Inc., 1997. [13] K. Levenberg, A method for the solution of certain problems in least squares, Q. Appl. Math. 2 (1944) 164–168. [14] D. Marquardt, An algorithm for least-squares estimation of nonlinear para- meters, SIAM J. Appl. Math. 11 (1963) 431–441. [15] Changhua Yu, Michael T. Manry, Jiang Li, Effects of nonsingular pre-processing on feed-forward network training, Int. J. Pattern Recognit. Artif. Intell. 19 (2) (2005) 217–247. [16] S. Ergezinger, E. Thomsen, An accelerated learning algorithm for multilayer perceptrons: optimization layer by layer, IEEE Trans. Neural Netw. 6 (1) (1995) 31–42. [17] M.T. Manry, S.J. Apollo, L.S. Allen, W.D. Lyle, W. Gong, M.S. Dawson, A.K. Fung, Fast training of neural networks for remote sensing, Remote Sens. Rev. 9 (1994) 77–96. [18] S.A. Barton, A matrix method for optimizing a neural network, Neural Comput. 3 (3) (1991) 450–459. [19] J. Olvera, X. Guan, M.T. Manry, Theory of monomial networks, in: Proceedings of SINS'92, Automation and Robotics Research Institute, Fort Worth, Texas, 1992, pp. 96–101. [20] G.D. Magoulas, M.N. Vrahatis, G.S. Androulakis, Improving the convergence of the backpropagation algorithm using learning rate adaptation methods, Neural Comput. 11 (7) (1999) 1769–1796 (October 1). [21] Amit Bhaya, Eugenius Kaszkurewicz, Steepest descent with momentum for quadratic functions is a version of the conjugate gradient method, Neural Netw. 17 (1) (2004) 65–71. [22] S.S. Malalur, M.T. Manry, Multiple optimal learning factors for feed-forward networks, in: Proceedings of SPIE: Independent Component Analyses, Wave- lets, Neural Networks, Biosystems, and Nanoengineering VIII, vol. 7703, Orlando Florida, April 7–9, 2010, 77030F. [23] P. Jesudhas, M.T. Manry, R. Rawat, S. Malalur, Analysis and improvement of multiple optimal learning factors for feed-forward networks, in: Proceedings of International Joint Conference on Neural Networks (IJCNN), San Jose, California, USA, July 31–August 5, 2011, pp. 2593–2600. [24] W. Kaminski, P. Strumillo, Kernel orthonormalization in radial basis function neural networks, IEEE Trans. Neural Netw. 8 (5) (1997) 1177–1183. [25] R.P. Lippman, An introduction to computing with neural nets, IEEE ASSP Magazine (1987) (April). [26] R. Battiti, First- and second-order methods for learning: between steepest descent and Newton's method, Neural Comput. 4 (1992) 141–166. [27] S. Raudys, Statistical and Neural Classifiers: An Integrated Approach to Design, Springer-Verlag, London, 2001. [28] M.F. Moller, A scaled conjugate gradient algorithm for fast supervised learning, Neural Netw. 6 (1993) 525–533. [29] P.L. Narasimha, W.H. Delashmit, M.T. Manry, Jiang Li, and F. Maldonado, An integrated growing-pruning method for feed-forward network training, NeuroComputing 71 (2008) 2831–2847. [30] University of Texas at Arlington, Training Data Files – 〈http://www.uta.edu/ faculty/manry/new_mapping.html〉. [31] M.T. Manry, H. Chandrasekaran, C.-H. Hsieh, Signal Processing Applications of the Multilayer Perceptron, book chapter in Handbook of Neural Network Signal Processing, in: Yu Hen Hu, Jenq-Nenq Hwang (Eds.), CRC Press, Boca Raton, 2001. [32] US Census Bureau [〈http://www.census.gov〉] (under Lookup Access [〈http:// www.census.gov/cdrom/lookup〉]: Summary Tape File 1). [33] Bilkent University, Function Approximation Repository – 〈http://funapp.cs. bilkent.edu.tr/DataSets/〉. [34] University of Toronto, Delve Data Sets – 〈http://www.cs.toronto.edu/~delve/ data/datasets.html〉. [35] A. Frank, A. Asuncion, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, CA, 2010 http://archive.ics.uci.edu/ml. [36] A.K. Fung, Z. Li, K.S. Chen, Back scattering from a randomly rough dielectric surface, IEEE Trans. Geosci. Remote. Sens. 30 (2) (1992). [37] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis., Modeling wine preferences by data mining from physicochemical properties, Decision Support Systems, vol. 47, Elsevier (2009) 547–553 (ISSN: 0167-9236). [38] A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig, Accurate telemonitoring of Parkinson's disease progression by non-invasive speech tests, IEEE Trans. Biomed. Eng. 57 (4) (2009) 884–893. Sanjeev S. Malalur received his M.S. and Ph.D. in Electrical Engineering from The University of Texas at Arlington, mentored by Prof. Michael T. Manry. For his doctoral dissertation, he developed a family of fast and robust training algorithms for the multi-layer percep- tron feed-forward neural networks. He developed a fingerprint verification system as a part of his Master's thesis. He is currently employed by Hunter Well Science where he continues to extend his research by developing techniques for modeling, analyzing and interpreting well log data. His research interests include machine learning algorithms, with an emphasis on neural networks, signal and image processing tech- niques, statistical pattern recognition and biometrics. He has authored several peer-reviewed publications in his field of research. Michael T. Manry is a professor at the University of Texas at Arlington (UTA) and his experience includes work in neural networks for the past 25 years. He is currently the director of the Image Processing and Neural Networks Lab (IPNNL) at UTA. He has published more than 150 papers and he is a senior member of the IEEE. He is an associate editor of Neurocomputing, and also a referee for several journals. His recent work, sponsored by the Advanced Technology Program of the state of Texas, Bell Helicopter Textron, E-Systems, Mobil Research, and NASA has involved the development of techniques for the analysis and fast design of neural networks for image processing, parameter estimation, and pattern classification. He has served as a consultant for the Office of Missile Electronic Warfare at White Sands Missile Range, MICOM (Missile Command) at Redstone Arsenal, NSF, Texas Instruments, Geophysics International, Halliburton Logging Services, Mobil Research, Williams-Pyro and Verity Instruments. Dr. Manry is President of Neural Decision Lab LLC, which is a member of the Arlington Technology Incubator. The company commercializes software developed by the IPNNL. Praveen Jesudhas received the B.E in Electrical Engineer- ing from Anna University, India in 2008 and the M.S. degree in Electrical Engineering from the University of Texas at Arlington, USA in 2010. In his university research, Praveen worked on problems in iris detection, iris feature extraction, and second order training algorithms for neural nets. As a Pattern Recognition Intern at FastVDO LLC in Maryland, he developed particle filter-based vehicle track- ing algorithms and Adaboost-based vehicle detection soft- ware. Currently he is employed at Egrabber, India as a Research Engineer, where he uses machine learning to solve problems related to automated online information retrieval. His current interests include neural networks, machine learning, natural language processing and big data analytics. He is a member of Tau Beta Pi. S.S. Malalur et al. / Neurocomputing 149 (2015) 1490–1501 1501