TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme Learning Machine

1
TFFN: Two Hidden Layer Feed Forward Network
using the randomness of Extreme Learning Machine
Nimai Chand Das Adhikari∗, Arpana Alka∗, Dr. Raju K George†
∗ Masters in Machine Learning and Computing, Indian Institute of Space Science and Technology, Trivandrum
†Dean, Indian Institute of Space Science and Technology, Trivandrum
Abstract—The learning speed of the feed forward neural
network takes a lot of time to be trained which is a major
drawback in their applications since the past decades. The
key reasons behind may be due to the slow gradient-based
learning algorithms which are extensively used to train the
neural networks or due to the parameters in the networks
which are tuned iteratively using some learning algorithms.
Thus, in order to eradicate the above pitfalls, a new learning
algorithm was proposed known as Extreme Learning Machines
(ELM). This algorithm tries to compute Hidden-layer-output
matrix that is made of randomly assigned input layer and
hidden layer weights and randomly assigned biases. Unlike the
other feedforward networks, ELM has the access of the whole
training dataset before going into the computation part. Here,
we have devised a new two-layer-feedforward network (TFFN)
for ELM in a new manner with randomly assigning the weights
and biases in both the hidden layers, which then calculates the
output-hidden layer weights using the Moore-Penrose generalized
inverse. TFFN doesn’t restricts the algorithm to fix the number
of hidden neurons that the algorithm should have. Rather it
searches the space which gives an optimized result in the neurons
combination in both the hidden layers. This algorithm provides a
good generalization capability than the parent Extreme Learning
Machines at an extremely fast learning speed. Here, we have
experimented the algorithm on various types of datasets and
various popular algorithm to find the performances and report
a comparison.
Index Terms—Artificial Neural Networks, Extreme Learning
Machines, Generalized Inverse, Pseudo Inverse, Least Squares
Solution, Back propagation, Hidden Neurons, Randomness
I. INTRODUCTION
Back Propagation and its variants have played a dominant
role in the training of the Feed Forward Neural Networks. But
there are several issues such as local optimum, trivial manual
intervention and time consuming in training the parameters
which this algorithm faces. Although many researchers are
working to find out a more efficient learning algorithm for
the feed-forward neural network which even consumes less
time in training. SVM as an alternative solution to the FF-
NN, became somewhat popular when researchers thought that
there wasn’t any other neural network to compensate for the
BP in the training of the Feed Forward Neural Networks.
ELM was originally inspired from the biological learning
and was proposed to overcome the challenges and the issues
that is faced by the BP algorithms [1] [2]. By taking the
background of the biological learning features, it has been
inferred that some part of the brain systems should have the
random neurons with all the parameters independent of the en-
vironments and the resultant technique known to be ELMs [3].
Its computer-based learning efficiency was verified as early as
in 2004, its universal approximation capability was rigorously
proved in theory in 2006−2008, and its concrete biological
behavior is seemed to be subsequently appear in early twenty
first century [4]. Unlike the other so-called randomness (semi
randomness) based learning methods/ networks, the hidden
nodes in ELM are not only independent of the training data
but also are independent of each other. Although the hidden
nodes are important and critical in this case these are not
tuned as in the case of other algorithms. These hidden nodes
are randomly generated beforehand. Unlike all the others
conventional learning methods, this learning method must
see the training examples even before the hidden nodes are
generated. ELM also generates the weight matrix before seeing
the training examples.
In the subsequent sections we will discuss upon the concepts
behind the ELM and propose a new network called TFFN
(Two Hidden Layer Feed Forward Network) using the
concepts and theorems behind its parent network Extreme
Learning Machines. We will show how this proposed archi-
tecture performs in comparison to some of the best known
algorithms in this field.
II. LEARNING PRINCIPLES AND CONCEPTS
A. Concepts for ELM
Fig. 1. ELM-SLFN
This algorithm was first proposed for the single-layer feed
forward neural networks (SLFNs) and was then extended to
the generalized single- hidden layer feed-forward networks in
which the hidden layer need not be a neuron like.
Considering the architecture point of view, the output function

2
of the ELM for the generalized SLFN can be as deduced
below:
fL(x) =
L
i=1
βihi(x) = h(x)β
Here, the β = [β1, β2, ..., βL]T
is a vector for the output
weights between the hidden layer of L nodes to the m ≥ 1
output nodes, and also the
h(x) = [h1(x), h2(x), ..., hL(x)]
is the output vector of the hidden layers with respect to
the input x [5]. Also we remember that the above hidden
matrix hL(x) is the row vectors. hi(x) is the output of the
i−th hidden node output, and the output functions of hidden
nodes may not be unique. We might be using different output
functions in many different hidden neurons. In general, hi(x)
can mostly be:
hi(x) = G(ai, bi, x), ai ∈ d
, bi ∈
This G(a, b, x) is a non-piece-wise continuous function which
fulfills the ELM universal approximation capability theorem
which will be discussed thoroughly in the upcoming sections
[4]. We will give a brief note about the different non linear
piece-wise continuous functions that are already defined in
the literature:
1. Sigmoid Function:
G(a, b, x) =
1
1 + exp(−(a.x + b))
2. Fourier function:
G(a, b, x) = sin(a.x + b)
3. Hardlimit function
G(a, b, x) =
1 a.x − b ≥ 0
0 otherwise
4. Gaussian function
G(a, b, x) = exp(−b||x − a||2
)
5. Multiquadrics function
G(a, b, x) = (||x − a||2
b2
)1/2
Below are the definitions and the learning principles
behind out proposing architecture [5]:
Definition 1:A neuron(or node)[3] is called a random
neuron(node) if all its parameters(e.g, a,b) in its output
function G(a,b,x) are randomly generated based on a
continuous sampling Distribution probability.
Definition 2:A hidden layer output mapping h(x) [6] is
said to be an ELM random feature mapping if all its
hidden node parameters are randomly generated according
to any continuous sampling distribution probability and
such h(x) has universal approximation capability, that is,
||h(x)β − f(x)|| = limL⇒∞||
L
i=1 βihi(x) − f(x)|| = 0
holds with the probability 1 with appropriate output weights β.
If we take into the account the Barlett’s neural network
generalization theory [7], for the feed-forward neural
networks for reaching the smaller training error, then the
smaller the norms of the weights are, the better generalization
performance the network tend s to have. Thus we infer that it
might be true with the generalized SLFNs where the hidden
neurons may not be neuron alike. Hence if we consider the
learning point of view of the ELM, then ELM’s theory aims
to reach the smallest error in the training part as well as the
smallest norm of the output weights between the hidden node
and the output node [4][8]:
Minimize : ||β||σ1
p + C||Hβ − T||σ2
q
Here σ1 > 0, σ2 > 0, p, q = 0, 1/2, 1, 2, ..., +∞, H is the
hidden layer output matrix(i.e randomized matrix) and C is
the regularized parameter [3]:
H =






h(x1)
.
.
.
h(xN )






= 





h1(x1) . . . hL(x1)
. . . . .
. . . . .
. . . . .
h1(xN ) . . . hL(xN )






and T is the training data target matrix:
T =






tT
1
.
.
.
tT
N






= 





t11 . . . t(1m)
. . . . .
. . . . .
. . . . .
t1(N1) . . . t(Nm)






Now let us present the some of the learning rules of ELM:
Learning Principle 1 Hidden neurons of SLFNs with almost
any nonlinear piece-wise continuous activation functions
or their linear combinations can be generated randomly
in accordance to any continuous sampling probability
distribution, and such hidden neurons can be independent of
training samples and also its learning environment. According
to the theory, the use of feature mappings h(x) can be used
in ELM for which it can approximate any of the continuous
target functions. Activation functions like sigmoid function
which are used in the artificial neural networks are an
oversimplified modeling of brain neurons and may be very
much different from what they might be. But it is true that

3
the actual activation function of a real brain is unknown.
The exact activation function of a live brain neuron may
be impossible to know. But it can be assumed that the
original function (activation) might be nonlinear piece-wise
continuous [3].
Hence, Learning Principle I of ELM may be widely
adopted in some brain learning mechanism without the need
of knowing the actual activation function of living brain
neurons.
B. Pseudo-inverse: Moore-Penrose generalized inverse
Let us consider a system nXn of linear system as given by
[9][10]:
Ax = b, A ∈ Mnn, b ∈ n
The above system will have a unique solution iff the matrix
A has a full rank [10]. In that case the value of x (in case it
is unique) will be given by
x = A−1
b
Now let us consider a system of m X n of linear equations
Ax = b, (A ∈ Mmn, b ∈ m
)
Then there are two cases arising:
Case 1: If m > n, the system is over-determined. In these
kinds of systems, there is no solution. In other words, there is
no such x ∈ n
such that
Ax = b,
or
b − Ax = 0
When there is no exact solution, the residual can be written
as
r(x) = b − Ax, x ∈ n
and try to find a vector x ∈ n
for which
||r(x)|| = ||b − Ax||
Let us define some definitions and theorems related to
describe this topic.
Definition 1. A vector x that minimizes ||r(x)||2 is called a
least-squares solution to the system defined above [11]. The
least squares solution x which has the minimum 2-norm is
called the minimum norm least squares solution [12].
Let us now give an example on the understanding of this
concept. Let ’y’ is any other least square solution to the
system Ax = b, then to satisfy the above definition
||x||2 < ||z||2
Now we will be proving later that the minimum norm least-
squares solution to the over-determined system is given by
x = A†
b
Here A†
is the pseudo-inverse of A.
C. Generalized Inverse
If A is any matrix, there is a generalized inverse, A−
such
that [10],
AA−
A = A
Now, this equation is an extrapolated from the conjuncture that
any matrix has at-least one-sided inverse. Let A−
is equal to
either L or R (i.e Left and Right sided inverse respectively).
Then,
ALA = A(LA) = AI = A, ARA = (AR)A = IA = A
If A is a n X m matrix , A−
is then a m X n matrix and the
resultant identity matrix either has the rank equal to columns
or rows. It is known that when m = n and when rank(A) =
n then A−
= A−1
. There are many properties of this but the
most important of all those is that the generalized inverses A−
are not unique.
1) Moore-Penrose Inverse: In this section we will be
defining the pseudo-inverse A†
of an m x n matrix A, and
illustrate how we can compute it using the various methods.
Definition 2.Let A be any real m x n matrix. Then the
pseudo-inverse of A is an n x m matrix X (instead of calling
it A†
satisfying the following Moore-Penrose conditions:
(MP1) AXA = A
(MP2) XAX = X
(MP3) (AX)’ = AX
(MP4) (XA)’ = XA
(AT
)†
= (A†
)T
D. Minimum Norm Least-Squares Solution
A vector x which minimizes ||r(x)||2 is called a least-
squares solution to the system described above. Also the least-
squares solution x which has the minimum 2-norm is called the
minimum norm least squares solution, i.e. if we say that z us
any other least squares solution to the above system Ax = b,
then we must have
||x||2 < ||z||2
Hence this way of finding the least-squares solutions for the
linear system like Ax = b is called the linear least-squares
problem [10][1][2][4]. Here we are going to establish a result
for the system Ax = b is over determined and has full rank,
then it has a unique l.s.s x obtained by solving the normal
system
AT
Ax = AT
b
But the matrix AT
A is ill-conditioned most frequently and is
influenced by the rounding errors. Thus, when the A is over
determined and has a full rank,
k2(AT
A) =
α2
1
α2
n
= [k(A)]2

4
E. Basic Least-Squares Theorem
Let us now describe and prove the most important theorem
for our proposed work [10].
Theorem 1:Let a linear system be there
Ax = b
Where A is a real m × n matrix, with m ≥ n, and b ∈ m
.
Then,
(a) The above linear system has a unique least-squares solution
x iff A has a full rank.
(b) The above linear system has infinitely many least-squares
solutions iff A is rank-deficient.
(c)The minimum norm least-squares solution to the system is
given by
x = A†
b
III. CONCEPT OF TFFN: TWO HIDDEN LAYERS
FEEDFORWARD NETWORK
Extreme Learning Machines (ELM) for two-hidden layer
feed-forward neural networks (TFFNs) which randomly
chooses the hidden node weights and biases and then
analytically determines the output weights β. In theory,
this algorithm also tends to provide good generalization
performance with an extremely fast learning speed like
SLFNs in comparison to the multilayer Back Propagation
. The experimental results based on a few artificial and
real benchmark function approximation and classification
problems including very large complex applications show
that the new algorithm can produce good generalization
performance in most cases and can learn faster than the
conventional popular learning algorithms for feed-forward
neural networks.
Similarly if instead of one hidden layer two hidden layers
Fig. 2. Two Hidden Layer Feedforward ELM
are put and the weights and biases are randomly generated
according to theories and the theorems discussed above in the
previous sections, the results will be promising and even some
times better than the SLFNs. Here we present a new concept
of making the ELM to go further than its single hidden
layer. Similar to the SLFN ELMs, after the input weights
and the hidden layer biases are chosen arbitrarily, TFFNs
can be simply considered as a linear system and the output
weights of the TFFN can be analytically determined through
the simple generalized inverse operation of the hidden layer
output matrices [4].
A. Proposal
As already described above instead of SLFN the concepts
behind the idea of two hidden layer Feed forward network
using ELM will be described here. The theory behind this
is same as that of the SLFN as the weight matrix and bias
matrix for the hidden layers are randomly generated. Let us
think about the structure having two hidden layer with nodes
ˆN1 and ˆN2 nodes. Also the training example that are with the
algorithm be arbitrary distinct samples as (xi, ti), where xi =
[xi1, xi2, ..., xin]T
, ∈ n
and let ti = [ti1, ti2, ..., tim]T
∈ m
.
Then we can mathematically model the system as:
ˆN2
i=1
βig2(Wi.(
ˆN1
j=1
g1(wj.xj + bj) + bi)) = Ok, ∀k = 1, ..., N
Here, N is the number of the training examples of
the system. Also wi = [wi1, wi2, ..., win]T
and wj =
[wj1, wj2, ..., wj ˆN1
]T
are the weight matrices that are ran-
domly generated for the system. Also βi = [βi1, βi2, ..., βim]
be the output weight matrix of the system. Then the above
system can be written as
Hβ = T
Here the H is the hidden layer output matrix for the system
and can be generated as
H =


g2(wj1.x1 + b1) . . . g2(wj ˆN1
.x1 + b ˆN2
)
. . . . .
g2(wj1.x ˆN1
+ b1) . . . g2(wj ˆN1
.x ˆN1
+ b ˆN2
)


ˆN1× ˆN2
Here the hidden layer is found from equivalent hidden layers
of the SLFNs but the number will be two instead of one
in case of the SLFN. Also the time of computation will be
somewhat more than that of the SLFN as the number of
hidden layers is more than the SLFN.
IV. SINGLE-HIDDEN-LAYER FEED FORWARD NETWORKS
VERSUS MULTI-HIDDEN-LAYER FEED FORWARD
NETWORKS
It is difficult to deal with multi-hidden layers of ELM
directly without having any ideas on single hidden layer of
ELM. Thus, in the past 10 years, most of the ELM works
has been focused on generalized single hidden-layer feed
forward networks (SLFNs).The concepts behind the TFFNs
are the same as that of the SLFNs. In the TFFNs we can
say that there are two ELM theories running, one for the
input layer-first hidden layer-second hidden layer and another
for the first hidden layer-second hidden layer -output layer.
Hence the concepts will be the same as that for the SLFNs
ELM.
The theorems and the theories behind this TFFN is as given
below [1][2][3][4][12][13]:

5
Theorem 2: Universal approximation capability
Given any bounded non-constant piece-wise continuous
function as the activation function in hidden neurons,
if by tuning parameters of hidden neuron activation
function SLFNs can approximate any target continuous
function, then for any continuous target function f(x)
and any randomly generated function sequence hi(x)L
i=1,
LimL⇒∞||
L
i=1 βihi(x) − f(x)|| = 0 holds with probability
one with appropriate output weights β.
Classification Capability: Similar to the approximation
capability theorem of single-hidden-layer feed forward neural
networks, it can be proved that the classification capability
for the SLFNs with the hidden-layer mapping h(x) satisfying
the universal approximation condition.
Definition 1: A closed set is called a region regardless
whether it is bounded or not.
Lemma 1: Given some disjoint regions K1, K2, ..., Km in d
and the corresponding m arbitrary real values c1, c2, ..., cm
and an arbitrary region X which is disjoint from any of
Ki, then there exists a continuous function f(x) such that
f(x) = ci if x ∈ Ki and f(x) = c0 if x ∈ X, where c0 is
any arbitrary real value different from c1, c2, ..., cp
Now we can define the theorem known as Classification
Capability theorem.
Theorem 3 : Classification Capability Theorem
Given a feature mapping h(x), if h(x)β is dense in C( d
) or
in C(M), where M is a compact set in d
, then a generalized
SLFN with such a hidden-layer mapping h(x) can separate
arbitrary disjoint regions of any shapes in d
or M.
Thus, according to the above theorems, it is necessary
and sufficient condition that the feature mapping h(x)
is chosen to make the h(x)β to have the capability for
approximating any of the target continuous function. Again
if the h(x)β cannot approximate any target continuous
functions, then there may exist some shapes of regions which
might not be separated any classifier with such kind of feature
mapping h(x). Also as the dimensionality of the feature
mapping is large, the output by the classifier h(x)β will also
be as close to the class labels in those corresponding regions
as possible.
If the binary classification case is considered, ELM only uses
a single output node, and the class label closer to the output
value of the ELM is that of the predicted class label of that
input data.
A. Algorithm
Given a training set, activation function g1(x) and g2(x)
and the hidden neuron number ˆN1 and ˆN2,
1) Input the data into the model
2) Divide the data into the train and validation samples.
The algorithm uses Randomized Search to achieve the
optimized Hyperparameters (learning rate, regularization
parameter in cost function, hidden neurons and hidden
biases).
• For the training set, assign arbitrary input weight
w1 and w2. Also the biases B1 and B2.
• Assign arbitrary input weight w1 and w2. Also the
biases B1 and B2.
• The output weight β will be calculated as:
β = H†
T
3) Optimized hypermeter setting network passed for the
validation samples
4) Calculation of the output:
H ∗ β = T
Here, H, β and T are already described in the previous
sections.
V. RESULTS AND DISCUSSION
TABLE I
PERFORMANCE COMPARISON FOR GLASS DATASET
Algorithm Testing Rate Training Time (secs)
SVC-Sigmoid 66.154 0.058
SVC-rbf 63.077 0.0623
Logistic Regression L1 67.6923 0.0235
Decision Trees 61.54 0.25
Random Forest 78.462 3.98
EM-SLFN 74.2115 0.044677
TFFN 75.46929 0.16709
Bagging 80 1.35
MLP 71.24 3.87
A comparison of an ELM with two hidden layers where
the weight matrices for the input layer to the first hidden
layer and the weight matrix for the second hidden layer is
generated randomly as well as the biases in the both the
hidden layer is made with that of ELM with SLFN. The
dataset taken for the comparison is the Glass dataset. Here for
all the algorithms 70% is taken for the training set and 30%
id taken for the testing or the validation set. The prediction
for the SLFN as we found out to be 74.2115 % whereas
the prediction for the TFFN is found out to be 75.47%.
Also from the above results, Bagging using the optimized
search with decision trees gives a better result of 80%. But
the time taken is very large i.e more than a minute. But if
we see our proposed algorithm gives better result of 75%
validation accuracy and time take is 0.16 seconds. Both these
set of comparisons were made after a lot of testing and
these readings were taken 20 times and the average is listed
above. If the training time is taken into the consideration than
SLFN-ELM still has the best training time.
In the table II, for the pima-indians-diabetes dataset, we
can see that our proposed algorithm has performed well both
in terms of the validation accuracy and the training time
taken. Its parent algorithm i.e SLFN-ELM has the accuracy of
76.78% in 0.4 seconds in comparison to TFFN with 77.66 %.
Here for all the algorithms we have taken the random search

6
TABLE II
PERFORMANCE COMPARISON FOR PIMA-INDIANS-DIABETES
DATASET
Algorithm Testing Rate Training Time (secs)
SVC-Sigmoid 65.4 0.92
SVC-rbf 73.6 0.786
SVC-Linear 77.6518 0.6563
Decision Trees 64.07 0.78
Random Forest 72.73 9.98
EM-SLFN 76.78 0.4052
TFFN 77.66 0.86709
Bagging 75.32 10.5
MLP 73.16 15.23
and grid search to obtain at the optimized hyperparameters
setting for running the algorithms in the validation part.
Fig. 3. ELM Vs TFFN results for different datasets
In the figure 3 we can find that ELM-SLFN and ELM-
TFFN performances in the testing validation for the datasets
Hepatitis, Diabetes, Haberman, Dermatology and Fertility. For
some of the datasets the performance of TFFN is extremely
good. While for others its performance is comparable. Also the
only demerit is the training time. Its about 5 times the time
taken by ELM-SLFN. Thus along with some merits, there are
also some demerits.
VI. CONCLUSION
The randomness in the TFFN helps in avoiding the iterations
that takes place in the optimizing the parameters in case of
the multilayer perceptron. Apart from the iterations, the TFFN
has a better accuracy in most of the standard datasets present
in the literature. Thus, this network apart from being very
fast in learning, does optimize the parameters to have a better
accuracy. The demerit of this architecture is that is doesn’t
know how to give priority to some of the learning training
data apart from that the randomness can also scrutinize this
algorithm to be moving in only one direction of learning.
VII. REFERENCES
[1] Huang, Gao, et al. ”Trends in extreme learning ma-
chines: A review.” Neural Networks 61 (2015): 32-48.
[2] Cambria, Erik, et al. ”Extreme learning machines [trends
& controversies].” IEEE Intelligent Systems 28.6 (2013): 30-
59.
[3] Huang, Guang-Bin. ”An insight into extreme learning
machines: random neurons, random features and kernels.”
Cognitive Computation 6.3 (2014): 376-390.
[4] Huang, Guang-Bin, et al. ”Extreme learning machine for
regression and multiclass classification.” IEEE Transactions
on Systems, Man, and Cybernetics, Part B (Cybernetics) 42.2
(2012): 513-529.
[5] Huang, Guang-Bin, Qin-Yu Zhu, and Chee-Kheong Siew.
”Extreme learning machine: theory and applications.” Neuro-
computing 70.1 (2006): 489-501.
[6] Funahashi, Ken-Ichi. ”On the approximate realization of
continuous mappings by neural networks.” Neural networks
2.3 (1989): 183-192.
[7] Anthony, Martin, and Peter L. Bartlett. Neural network
learning: Theoretical foundations. cambridge university press,
2009.
[8] Huang, Guang-Bin, and Lei Chen. ”Convex incremental
extreme learning machine.” Neurocomputing 70.16 (2007):
3056-3062.
[9] Albert, Arthur. Regression and the Moore-Penrose pseu-
doinverse. Elsevier, 1972.
[10] Penrose, Roger. ”A generalized inverse for matrices.”
Mathematical proceedings of the Cambridge philosophical
society. Vol. 51. No. 3. Cambridge University Press, 1955.
[11] Golub, Gene H., and Charles F. Van Loan. ”An analysis of
the total least squares problem.” SIAM Journal on Numerical
Analysis 17.6 (1980): 883-893.
[12] Lawson, Charles L., and Richard J. Hanson. Solving
least squares problems. Society for Industrial and Applied
Mathematics, 1995.
[13] Huang, Guang-Bin, et al. ”Extreme learning machine for
regression and multiclass classification.” IEEE Transactions
on Systems, Man, and Cybernetics, Part B (Cybernetics) 42.2
(2012): 513-529.

TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme Learning Machine

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme Learning Machine

Similar to TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme Learning Machine (20)

More from Nimai Chand Das Adhikari

More from Nimai Chand Das Adhikari (10)

Recently uploaded

Recently uploaded (20)

TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme Learning Machine