The leaning speed of the feed forward neural networks is very much slower than
as expected and this is the major drawback/pitfall in their applications since the
past decades. The key reasons behind may
1. Due to the slow gradient-based learning algorithms which are extensively used
to train the neural networks.
2. All the parameters in the networks are tuned iteratively using some learning
algorithms.
Thus, in order to eradicate the above pitfalls, a new learning algorithm was pro
posed known as Extreme Learning Machine (ELMs). This algorithm is for the
single hiddenlayer feed forward networks(SLFNs) which randomly initializes the
input hidden node weights and biases of the hidden nodes after that calculates
the output weights. This algorithm provide the good generalization performance
at an extremely fast learning speed. In this thesis we have experiemented the
algorithm on various types of datasets and various popular algorithm to find the
performances and report a comparison.
We have devised a two-layer-feedforward network for ELM in a new manner with
randomly assigning the weights and biases in both the hidden layer. We have
also studied the ELM autoencoders and thoroughly experimented it with various
datasets and deep networks.
We have implemented ELM with recommender systems to build a new music app
to recommend the user some songs based on the history of the use.
1. EXTREME LEARNING MACHINES
Nimai Chand Das Adhikari
INDIAN INSTITUTE OF SPACE SCIENCE AND TECHNOLOGY
Advisor:
Dr. Raju K George
Dean (R & D and Student Welfare)
6th June, 2016
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 1 / 83
2. Overview
1 Introduction
2 Brief Outline of the Thesis
3 Concepts Used
4 Theorems Used
5 Extreme learning Machine
6 Extreme Learning Machines - Two Hidden Layer Feed Forward Network
7 Extreme Learning Machines - Auto Encoders
8 Heirarchical Extreme Learning Machines
9 Recommender Sytems for building a music app using ELM
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 2 / 83
3. Introduction
Mainly the neural networks are trained with the training sets (xi , ti )N
i=1.
Where N is the number of the training examples fed to the neural
network.
1 For a function in a training set, A SLFN with maximum of N hidden
nodes with any non-linear activation function can easily learn N
distinct observations with zero error.
2 The gradient based learning persisted to a huge extent and was the
sole theory behind all the learning algorithms of the feed forward
neural networks.
3 The drawbacks of this method i.e the slow learning due to improper
learning steps and converging to a local minima, has asked for a
change in the methodology. Apart from this, many iterative steps are
required for the optimized results to occur.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 3 / 83
4. Brief Outline of the thesis
1 A brief Description on the concepts behind the ELM.
2 A new approach to the ELM using the Two Hidden layers instead of
the Single layer which has been the basis for the ELM.
3 ELM Auto Encoders and Heirarchical ELM which have two basic
learnings in that:
1 unsupervised learning for the features extraction.
2 basic ELM for the classification in the last layer.
4 Concepts and ideas behind the building of a new musical app using
the Recommendation Systems and Neural Networks.
5 Conclusions and Future Works.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 4 / 83
5. Single Layer Feed Forward Neural Networks
The SLFN’s with an arbitrary chosen inputs weights can easily learn N
distinct observations with a very small error.
Input weights and biases is not an issue.
The system becomes linear and output can be easily calculated using
the generalized inverse.
The learning speed becomes extremely fast.
The algorithm that we are going to learn is Extreme Learning Machine
(ELM).
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 5 / 83
6. Single Layer Feed Forward Neural Networks
Figure: Single Layer Feed Forward Neural Networks
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 6 / 83
7. Concepts Used
Moore Penrose generalized Inverse
Least Squares Solution
Random Features Mappings and Kernels
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 7 / 83
8. Concepts Used
Moore Penrose generalized Inverse
Least Squares Solution
Random Features Mappings and Kernels
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 8 / 83
9. Moore Penrose Generalized Inverse
Consider a system of n × n linear system:
Ax = b, A ∈ Mnn, b ∈ n
This system will have a unique solution iff matrix A is full rank. Then
x = A−1
b
Again, let the system be
Ax = b, A ∈ Mmn, b ∈ m
There are two cases:
m > n, System is overdetermined
m < n, System is under determines
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 9 / 83
10. Moore Penrose Generalized Inverse: Over Determined
Systems
In this case, there is no such x ∈ n such that:
Ax = b
b − Ax = 0
Hence, there will be no exact solution and the residual for this system can
be:
r(x) = b − Ax, x ∈ n
Therefore we will find a vector for which
||r(x)|| = ||b − Ax||
is minimum.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 10 / 83
11. Moore Penrose Generalized Inverse: Definitions
A vector which minimizes ||r(x)||2 is the least squares solution for the
system defined in the above slide. The least squares solution x which
has the minimum2-norm is called the minimum norm least squares
solution.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 11 / 83
12. Moore penrose Inverse
Let A be any m × n matrix then the pseudo inverse of A is an n × m
matrix A† which satisfies the following Moore-Penrose conditions:
(MP1) AA†A = A
(MP2) A†AA† = A†
(MP3) (AA†)T = AA†
(MP4) (A†A)T = A†A
Theorem:
Let A be any real m × n matrix. Then
the pseudoinverse of A is unique
A††
= A
(AT )† = (A†)T
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 12 / 83
13. Moore Penrose Inverse - How to compute?
If A is an n × n matrix and non singular, then
A†
= A−1
If A is an m × n real matrix with m ≥ n and having full rank, i.e.
rank(A) = n, then
A†
= (AT
A)−1
)AT
If A is an m × n real matrix with m ≤ n and having full rank, i.e.
rank(A) = m, then
A†
= AT
(AAT
)−1
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 13 / 83
14. Moore Penrose Inverse - Examples
Consider,
A =
1
2
has an inverse,
A†
= 1/5 2/5
While its other inverse A−L = [3 − 1] doesn’t satify the fourth condition
while the previous satisfies the above four conditions.
Note:
1. A matrix that satisfies the first two conditions is called a Generalized
inverse.
2. While the uniqueness is established by the last two conditions
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 14 / 83
15. Concepts Used
Moore Penrose generalized Inverse
Least Squares Solution
Random Features Mappings and Kernels
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 15 / 83
16. Least Squares Solution
Two different cases that arise for solving the linear system:
Ax = b, A ∈ Mmn, x ∈ n
, b ∈ m
When m > n, the linear system is called over determined
When m < n, the linear system is called under determined
A system is solvable iff
rank(A) = rank(A|b)
If there is no exact solution for a linear system, we get the residual as
r(x) = b − Ax, x ∈ n
Then we try to seek a vector x ∈ n for which ||r(x)||2 = ||b − Ax||2 is
minimum.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 16 / 83
17. Minimum Norm Least Squares Solution
The least-squares solution x which has the minimum 2-norm is called the
minimum norm least squares solution, i.e. if we say that z is any other
least squares solution to the above system Ax = b, then we must have
||x||2 < ||z||2
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 17 / 83
18. Basic Least Squares Theorem
Let there be a linear system
Ax = b
with A being a real m × n matrix with m ≥ n and b ∈ m. Then
1 The above linear system has a unique least-squares solution x iff A
has a full rank.
2 The above linear system has infinitely many least-squares solutions iff
A is rank-deficient.
3 The minimum norm least-squares solution to the system is given by
x = A†
b
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 18 / 83
19. Concepts Used
Moore Penrose generalized Inverse
Least Squares Solution
Random Features Mappings and Kernels
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 19 / 83
20. Random Features Mappings and Kernels
Random Feature Mappings: Hidden Layer Output vector i.e
h(x) = [h1(x), ..., hL(x)] is mostly calculated if all its hidden node
parameters are randomly generated according to any continuous sampling
distribution probability and such that h(x) has the universal approximation
capability, i.e. ||h(x)β − f (x)|| = limL−→∞|| L
i=1 βi hi (x) − f (x)|| = 0
holds with a probability 1 with appropriate output weights. Thus,
h(x) = G(a1, b1, x), ..., G(aL, bL, x)
here, the G(a,b,x) is a non linear piecewise function that satisfies the ELM
universal approximation capability theorem.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 20 / 83
21. Random Features Mappings and Kernels
Kernels: Instead of h(x) we can apply kernel matrix in ELM.
ˆH = HHT
: ˆHi,j = h(xi ).h(xj ) = K(xi , xj )
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 21 / 83
22. Random Features Mappings and Kernels
Feature Mapping Matrix: hi (x) denotes the output of the hidden layer
with the regard to the input space, and the feature mapping matrix is
mostly irrelevant to the target ti . Also the feature mapping is reasonable
to having the feature mapping matrix which is independent from its target
values.
Figure: ELM Feature Mapping
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 22 / 83
23. Theorems
Universal Approximation Capability
Given any bounded non-constant piece-wise continuous function as the
activation function used in the hidden neurons, if by tuning parameters of
hidden neuron activation function SLFNs can easily approximate any
target continuous function, then for any continuous target function f (x)
and any randomly generated function sequence
[hi (x)]L
i=1, limL−→∞|| L
i=1 βi hi (x) − f (x)|| = 0 holds with a probability 1
with appropriate output weights β.
Classification Capability Theorem
Given a feature mapping h(x), if h(x)β is dense in C(Rd ) or in C(M),
where M is a compact set in Rd , then a generalized SLFN with such a
hidden layer mapping h(x) can seperate arbitrary disjoint region of any
shapes in Rd or M.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 23 / 83
24. Extreme Learning Machines-Introduction
Gradient descent based methods have mainly been the backbone for
the feed forward neural networks.
In such case, the parameters for the feed forward neural networks
need to be tuned and for learning, the time taken is very large.
If in case of SLFNs, the input hidden layer weights and biases are
randomly assigned then the system will be linear and can be
computed through the simple generalized inverse operation.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 24 / 83
26. Extreme Learning Machines - SLFNs
Output of the hidden nodes
G(ai , bi , x) = g(ai .x + bi )
here, ai : the weight vector connecting the ith hidden node and the input
nodes.
and bi : the threshold of the ith hidden node.
Output of the Network
fL(x) =
L
i=1
βi .G(ai , bi , x)
here, G(.) is the activation function.
and L is the number of the hidden layer nodes.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 26 / 83
27. Extreme Learning Machine - Mathematical Model
Mathematical Model
L
i=1 βi G(ai , bi , xj ) = tj , j = 1, ..., N is equivalent to Hβ = T, where
H(a1, ..., aL, b1, ..., bL, x1, ..., xN) =
G(a1, b1, x1) ... G(aL, bL, x1)
. .
. .
. ... .
G(a1, b1, xN) ... G(aL.bL, xN)
N×L
β =
βT
1
.
.
.
βT
L
L×m
and T =
tT
1
.
.
.
tT
N
N×m
, here H is the hidden-layer-output
matrix
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 27 / 83
28. Extreme Learning Machines - Mathematical and Learning
Models
Mathematical Model
Any continuous traget function f (x) can be approximated by SLFNs.
Thus, given a small positive value , for SLFNs with enough number of
hidden layer nodes (L) we will have
||fL(x) − f (x)|| <
Learning Model
For N arbitrary distinct samples (xi , ti ) ∈ Rn × Rm, SLFNs with L
hidden nodes and activation function g(x) are mathematically
modeled as
fL(xj ) = oj , j = 1, ..., N
Cost Function: E = N
j=1 ||oj − tj ||2
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 28 / 83
29. Extreme Learning Machines - Learning Model
Learning Model
The target is to minimize the cost function E by adjusting the network
parameters βi , ai , bj
If,
= 0
FL(x) = f (x) = T, T is the known target and
Cost function = 0
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 29 / 83
30. Extreme Learning Machines - Algorithm
Three Step Learning Model
Given a training set S = (xi , ti )|xi ∈ Rn, ti ∈ Rm, i = 1, ..., N, activation
function G and the number of hidden nodes L,
1 Assign randomly input weight vectors ai and the hidden node bias bi ,
i = 1, ..., L.
2 Calculate the hidden layer output matrix H.
3 Calculate the output weight β :
β = H†
T
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 30 / 83
31. Extreme Learning Machines - Performance Comparison
Algorithm Training Rate Testing Rate Training Time (secs)
ELM 78.06 76.78 0.405290
SVM-linear 78.76 77.6518 0.6563
BP 77.88 73.04348 15.2361
Table: Performance Comparison for the Diabetes dataset
Figure: performance for diabates
dataset
Figure: performance for diabates
dataset
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 31 / 83
32. Extreme Learning Machines - Performance Comparison
DATASET Testing Rate Training Time (secs)
Glass 74.2115 0.044677
Hepatitis 70.380 0.037706
Breast Cancer 83.3333 0.024485
Table: Performance Comparison For various DATASETS by ELM
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 32 / 83
33. Extreme Learning Machines - Performance Comparison
DATASET Testing Rate Training Time (secs)
Glass 37.4129 10.548456
Hepatitis 48.3585 7.4125
Breast Cancer 77.609 2.3564
Table: Performance Comparison For various DATASETS by BP
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 33 / 83
34. Extreme Learning Machines - Performance Comparison
DATASET Testing Rate Training Time (secs)
Glass 34.5750 0.2354677
Hepatitis 63.0435 0.0632
Breast Cancer 66.67 0.2261
Table: Performance Comparison For various DATASETS by SVM
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 34 / 83
36. Extreme Learning Machines - Performance Comparison:
SinC function
Figure: performance for SinC
Function
Figure: performance for SinC
function
Figure: Comparison for the Training Time
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 36 / 83
38. Extreme Learning Machines - Two Hidden Layer Feed
Forward Neural Networks
1 Extreme Learning Machine (ELM) for single-hidden layer feed-forward
neural networks (SLFNs) which randomly chooses hidden nodes and
biases and then analytically determines the output weights of SLFNs.
2 This algorithm tends to provide good generalization performance with
an extremely fast learning speed.
3 If instead one hidden layer two hidden layers are put and the weights
and biases are randomly generated according to theories and the
theorems discussed above, then we get ELM-TLFN.
4 Similar to the SLFN ELMS, after the input weights and the hidden
layer biases are chosen arbitrarily, SLFNs can be simply considered as
a linear system and the output weights of the TLFN-ELM can be
analytically determined through the simple generalized inverse
operation of the hidden layer output matrices.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 38 / 83
39. Extreme Learning Machines - Two Hidden Layer Feed
Forward Neural Networks : Proposal
The weights and the biases of the two hidden layers are generated
randomly as done in case of the SLFNs.Then,
ˆN2
i=1
βi g2(Wi .(
ˆN1
j=1
g1(wj .xk + bj ))) = Ok, ∀k = 1, ..., N
Here, N is the total number of the training examples of the system. Also
wi = [wi1, ..., win]T and wj = [wj1, ..., wj ˆN1
]T are the weight matrices that
are generated randomly for the system. Alsoβi = [βi1, ..., βim]T be the
output weight matrix. Then, like the SLFn network the above system can
be written as:
Hβ = T
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 39 / 83
40. Extreme Learning Machines - Two Hidden Layer Feed
Forward Neural Networks: H matrix
Hidden Layer Output Matrix
Here the H is the hidden layer output matrix for the system and can be
generated as
H =
g2(wj1.g1 + b1) . . . g2(wj ˆN1
.g1 + b ˆN2
)
. . . . .
g2(wj1.g ˆN1
+ b1) . . . g2(wj ˆN1
.g ˆN1
+ b ˆN2
)
ˆN1× ˆN2
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 40 / 83
41. Extreme Learning Machines - Two Hidden layer Feed
Forward Neural Networks: Algorithm
Given a training set, activation function g1(x) and g2(x) and the hidden
neurons ˆN1 and ˆN2 for the first and second hidden layers respectively
1 Assign arbitrary input weight w1 and w2. Also the biases B1 and B2.
2 Calculate the hidden-layer output matrix of the system.
3 Calculate the output weight β:
β = H†
T
Here, H, β and T are already described in the previous sections.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 41 / 83
42. Extreme Learning Machines - Two Hidden layer Feed
Forward Neural Networks: Performance Comparison
Algorithm Testing Rate Training Time (secs) Hidden Nodes
ELM-SLFN 74.2115 0.044677 20
ELM-TLFN 75.46929 0.16709 10-20
Table: PERFORMANCE COMPARISON for GLASS Dataset with ELM SLFN and
ELM TLFN
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 42 / 83
43. Extreme Learning Machines - Two Hidden layer Feed
Forward Neural Networks: Performance Comparison
Figure: performance comparison for various dataset
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 43 / 83
44. Sparse Autoencoders- Introduction
Sparse Autoencoder learning algorithm is one approach to automatically
learn features from unlabeled data.In some domains, such as computer
vision, this approach is not by itself competitive with the best
hand-engineered features, but the features it can learn do turn out to be
useful for a range of problems (including ones in audio, text, etc).
Further, there are more sophisticated versions of the sparse autoencoder
that do surprisingly well, and in many cases are competitive with or
superior to even the best hand-engineered representations.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 44 / 83
46. Sparse Autoencoders-A brief Idea
So far, we have described the application of neural networks to supervised
learning, in which we are have labeled training examples.Now suppose we
have only unlabeled training examples set {x1, x2, x3...} where xi ∈ n.
An autoencoder neural network is an unsupervised learning algorithm
that applies backpropagation, setting the target values to be equal to
the inputs. i.e yi ≈ xi
The autoencoder tries to learn a function hW ,b(x) ≈ x. In other
words, it is trying to learn an approximation to the identity function,
so as to output ˆx that is similar to x.
The identity function seems a particularly trivial function to be trying
to learn; but by placing constraints on the network, such as by
limiting the number of hidden units, we can discover interesting
structure about the data.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 46 / 83
47. Sparse Autoencoder- Example
Features Representation
suppose the inputs x are the pixel intensity values from a 10 × 10 i.e
x ∈ 100, image (100 pixels) so n = 100, and there are s2 = 50 hidden
units in layer L2. but also we have y ∈ 100.
Reconstruction
Since there are only 50 hidden units, the network is forced to learn a
compressed representation of the input. i.e. given only the vector of
hidden unit activations a(2) ∈ 50 it must try to reconstruct the 100-pixel
input x.
Different layers
When the number of hidden units is large, we can still discover interesting
structure, by imposing other constraints on the network. In particular, if
we impose a sparsity constraint on the hidden units, then the autoencoder
will still discover interesting structure in the data.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 47 / 83
48. Sparse Autoencoders- Sparsity Constraints
Informally, we will think of a neuron as being active (or as firing) if its
output value is close to 1, or as being inactive if its output value is close
to 0. We would like to constrain the neurons to be inactive most of the
time. Now a
(2)
j (x) denotes the activation of the hidden unit j using the
input x. Let us define
ˆρj = 1/m
m
i=1
[a
(2)
j (xi
)]
be the average activation of the hidden unit j (averaged over the training
set. Then we would like to enforce the constraint
ˆρj = ρ
where ρ is a sparsity parameter(typically a small value ≈ 0.05).
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 48 / 83
49. Sparse Autoencoders- Sparsity Constraints
We would like the average activation of each hidden neuron j to be close
to 0.05 (say). To satisfy this constraint, the hidden units activations must
mostly be near 0. To achieve this another penalty term is introduced to
our optimization objective which penalizes ˆρj deviating from the ρ.
s2
j=1
ρ log (ρ/ ˆρj ) + (1 − ρ) log ((1 − ρ)/(1 − ˆρj ))
Here s2 is the number of neurons in the hidden layer, and the index jis the
summing over the hidden units in out network.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 49 / 83
50. Sparse Autoencoders- Kullback-Leibler (KL) divergence
The above penalty term
ρ log (ρ/ ˆρj ) + (1 − ρ) log ((1 − ρ)/(1 − ˆρj ))
is known as Kullback-Leibler (KL) divergence between a Bernoulli random
variable with mean ρ and a Bernoulli random variable with mean ˆρj .
KL-divergence is a standard function for measuring how different two
different distributions are.
Property:
KL(ρ|| ˆρj ) =
0 : ˆρj = ρ
increases
monotonically : ˆρj = ρ
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 50 / 83
52. Sparse Autoencoders- Cost Function
Now the overall cost function can be written as :
Jsparse(W , b) = J(W , b) + β
s2
j=1
KL(ρ|| ˆρj )
Let us define the two terms:
J(W,b) is the cost function used for the BP Algorithm. i.e
J(W , b) = [1/m
m
i=1
J(W , b; Xi
, yi
)] + λ/2
nl −1
l=i
sl
i=1
sl +1
j=1
(W
(l)
ji )2
The first term in the definition of J(W , b) is an average
sum-of-squares error term. The second term is a regularization term
(also called a weight decay term) that tends to decrease the
magnitude of the weights, and helps prevent overfitting.
In the second term λ controls the weight of the sparsity penalty term.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 52 / 83
53. Representational Learning with ELM for Big Data
(ELM-AE) - Introduction
A machine learning algorithms generalization capability depends on the
dataset, which is why engineering a datasets features to represent the
datas salient structure is important. However,feature engineering requires
domain knowledge and human ingenuity to generate appropriate features.
Similar to deep networks, ELM (ML-ELM) performs layer-by-layer
unsupervised learning. This article also introduces the ELM auto-encoder
(ELM-AE), which represents features based on singular values. Resembling
deep networks, ML-ELM stacks on top of ELM-AE to create a multilayer
neural network. It learns significantly faster than existing deep networks,
outperforming DBNs, SAEs, and SDAEs and performing on par with
DBMs on the MNIST dataset.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 53 / 83
54. (ELM-AE)- A brief of ELM
The ELM for SLFNs shows that hidden nodes can be randomly generated.
The input data is mapped to L-dimensional ELM random feature space,
and the network output is:
fL(x) =
L
i=1
βi hi (x) = h(x)β
where β = [β1, β2, β3, ..., βL]T is the output weight matrix between the
hidden nodes and the output nodes. Also h(x) = [g1(x), g2(x), ..., gL(x)]
are the hidden node outputs (random hidden features) for the input x and
gi (x) is the output of the i-th hidden node.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 54 / 83
55. (ELM-AE)- ELM
Given that there are N training samples (xi , ti )N
i=1, ELM is to resolve the
following learning problems:
Hβ = T
where T = [t1, ..., tN]T are the target labels and hT (x1), ..., hT (xN)]T .
Hence the output weights β can be calculated as given by:
β = H†
T
where H† is the Moore-Penrose generalized inverse of a Matrix ’H’.
To have a better generalization performance and to make the solution
more robust, one can add a regularization term as shown below:
β = (I/λ + HT
H)−1
HT
T
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 55 / 83
56. Extreme Learning Machines as (ELM-AE)
The main objective of ELM-AE is to represent the input features
meaningfully in three different representations:
Compressed representations Represent features from a higher
dimensional input data space to a lower dimensional feature space.
Sparse representation Represent features from a lower dimensional
input data space to a higher dimensional feature space.
Equal dimension representation Represent features from an input
data space dimension equal to feature space dimension.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 56 / 83
58. (ELM-AE)- Orthogonalisation
Othogonalisation of randomly generated hidden parameters(weights
and biases) tends to make the generalization performance of the
ELM-AE better.
ELM-AE orthogonal random weights and biases of the hidden nodes
project the input data to a different or equal dimension space as
shown by the Johnson-Lindenstrauss Lemma.
The weights and the biases are calculated as below:
h = g(ax + b)
aT
a = I, bT
b = 1
where a = [a1, a2, ..., aL] are the orthogonal random weights and
b = [b1, b2, ..., bL] are the orthogonal random biases between the
input nodes and hidden nodes.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 58 / 83
59. (ELM-AE)-Johnson-Lindenstrauss Lemma
” The lemma states that a small set of points in a high-dimensional space
can be embedded into a space of much lower dimension in such a way that
distances between the points are nearly preserved.”
Given 0 < < 1 a set X of m points in N, and a number n > ln(m)/ 2,
there is a linear map f : N −→ n such that,
(1 − )||u − v||2
≤ ||f (u) − f (v)||2
≤ (1 + )||u + v||2
for all u,v ∈ X.
One proof of the lemma takes to be a suitable multiple of the orthogonal
projection onto a random subspace of dimension n in RN, and exploits the
phenomenon of concentration of measure. Obviously an orthogonal
projection will, in general, reduce the average distance between points, but
the lemma can be viewed as dealing with relative distances, which do not
change under scaling.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 59 / 83
60. (ELM-AE)-Singular Value Decomposition
The SVD of the equation:
β = (I/λ + HT
H)−1
HT
T
is shown as:
Hβ =
N
i=1
ui d2
i /(d2
i + λ)uT
i X
where u are the eigen vectors of HHT , d are the singular values of H,
realted to the SVD of the input data X (as H is projected feature space of
X squashed via a sigmoid function). The hypothesis is that the output
weight β of the ELM-Ae will be learning to represent the features of the
input data via singular values.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 60 / 83
61. (ELM-AE)-Multilayer Extreme Learning Machine
Multilayer neural networks perform poorly when trained with back
propagation (BP) only, so we initialize hidden layer weights in a deep
network by using layer-wise unsupervised training and fine-tune the
whole neural network with BP.
Similar to deep networks, ML-ELM hidden layer weights are initialized
with ELM-AE, which performs layer-wise unsupervised training.
However, in contrast to deep networks, ML-ELM doesnt require fine
tuning.
ML-ELM hidden layer activation functions can be either linear or
nonlinear piecewise.
If the number of nodes Lk in the kth hidden layer is equal to the
number of nodes Lk1 in the (k 1)th hidden layer, g is chosen as
linear; otherwise, g is chosen as nonlinear piecewise, such as a
sigmoidal function:
Hk
= g((βk
)T
Hk−1
)
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 61 / 83
62. (ELM-AE)- Adding Layers in ML-ELM
Figure: ELM-AE addition of the layers and the working
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 62 / 83
63. (ELM-AE)-Performance Comparison
Algorithm Training (rate) Testing (rate) Training Time
ELM 95.45 98.03 1120.365
ML-ELM 99.51 98.75 785.235
SAE 98.5645 98.7812 -
DBN 94.568 98.56 20548
DL-CNN 97 96 61872
Table: Performance Comparison for MNIST dataset
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 63 / 83
65. Heirarchical ELM (H-ELM)- ELM for Multilayer perceptron
It consists of two basic components: 1. unsupervised feature learning 2.
supervised feature claasification. The ELM for both semisupervised and
unsupervised tasks based on the manifold regularization,and the unlabeled
or partially labeled samples are clustered using ELM.
The major difference between the H-ELM and the original ELM is that
before the ELM-based feature classification is done , H-ELM uses the
training to obtain the multilayer sparse representation to the raw input
data, whereas in ELM the raw data is used for regression or classification.
Hence the compact features can help to remove the redundancy of the
original inputs and thus improve the efficiency.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 65 / 83
66. (H-ELM)-Theorems
Theorem
Given any bounded nonconstant piecewise continuous function g:
−→ , if the span G(a,b,x) : (a,b) ∈ d X is dense in L2, for any
target function f and any function sequence gL(x) = G(aL, bL, x randomly
generated based on any comtinuous sampling distribution,
limn→∞ ||f − fn|| = 0 holds with probability one if the output weights βi
are determined by the ordinary least square to minimize
||f (x) − L
i=1 βi gi (x)||
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 66 / 83
68. (H-ELM)-Framework
H-ELM training architecture is structurally divided into two separate
phases:
Unsupervised hierarchical feature representation
Supervised feature classification. For the former phase
a new ELM-based autoencoder is developed to extract multilayer sparse
features of the input data, which is to be discussed in the next section;
while for the latter one, the original ELM-based regression is performed for
final decision making.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 68 / 83
70. Recommender System for a Music App using ELM
Recommender Algorithms are best known for the use in the e-commerce
websites, where they use the input about the customer’s interest to
generate a list of the recommended items.
There are three kinds of Recommender Algorithms:
Collaborative Filtering : This filtering approaches building a model
from the user’s past behaviour as well as similar decisions made by
other users. This model is then used to predict the items that the
user may have an interest in.
Content Based Filtering: This filtering approaches to utilize a series
of discrete characteristics of an item in order to recommend
additional items with similar properties.
hybrid System: When the above two approaches are combined
together then a new recommender system is formed.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 70 / 83
71. Recommender System for a Music App using ELM-
Proximity / Similarity Measure
It is the distance between the two customers using either the correlation or
the cosine measure
Correlation: In this case the similarity between the two users is
measured by computing the Pearson Correlation and is given by:
corra,b = i (rai − ˆra)
i (rai−ˆra )2
i (rbi − ˆrb)2
Cosine: In this case two customers a and b are thought of as two
vectors in the m dimensional product space (or the k-dimensional
space in case of reduced representation). The proximity between
them is measured by computing the cosine of the angle between the
two vectors, which is given by
cos(a, b) =
a.b
||a||2 ∗ ||b||2
Using this value the similarity matrix between the two users is made.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 71 / 83
72. Recommender System for a Music App using ELM- Idea
In this section we would be discussing how this music app will be working.
Let there are 10 songs in the system, then each song will be having the
following attributes:
Attribute 1: Genre : A music genre is a conventional category that
identifies some pieces of music as belonging to a shared tradition or
set of conventions. It is to be distinguished from musical form and
musical style, although in practice these terms are sometimes used
interchangeably.
Attribute 2: Number of times played: If a song is played more than
the 75 % of the total time of that particular song than the song is
played otherwise it is assumed to be not played. This is used so that
a boundary can be formed for distinguishing between the songs being
not played and forwarded and a song being played till a particular
time.
Attribute 3: Output : This is simply the song number when it was
downloaded to the device.
Here the most important attribute is the ’genre’.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 72 / 83
73. Recommender System for a Music App using ELM-
Dataset Making
There are three datasets used in this making of the app.
Datasetsongnames : This is the dataset for the songs that are kept in
the device. If a new song is added the dataset will have one more row
having the song and the details along with it.
Playlist : This matrix is made within the application of this system. It
is an t × m matrix. In which t is the number of the times the app is
played and the background data tries to store the song details and m
is the number of times songs are played from the app in one go.
RecommPlaylist: This is a n X n matrix where ’n’ is the number of
songs in this device. Each cell tries to depict whether that particular
song is played in that playlist.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 73 / 83
74. Recommender System for a Music App using ELM-
Procedure
1. Measure Similarity using
Jaccard Distance
Cosine Distance
2. Suggestions By ELM
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 74 / 83
75. Recommender System for a Music App using ELM- Results
Table: Songs in the device :
Name of the Song Genre Count Song Number
Saturday Saturday 1 15 1
Main koi Aisa Geet Gaaon 2 13 2
Nothing Else Matters 3 9 3
Kashmir 4 11 4
Paradise 5 7 5
Tu kisi Rail Si 5 11 6
Super Machi 7 11 7
Pani Da 5 9 8
Slim Shady 6 6 9
Sunn Raha Hai Na Tu 1 4 10
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 75 / 83
78. Recommender System for a Music App using ELM- Results
The songs recommended by the following algorithms are:
ELM: Song No. 1 2 4 5
When the song played is Song number 8. The genre for the above song
matches with the song number 5. Thus the user chooses the song number
5.
BP: Song No. 1 5 6 10
The song number 10 hasn’t been played so much.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 78 / 83
79. Conclusions and Future Works
Conclusions:
The test results support that ELM has a better generalization
capapbility along with a faster learning rate.
The TLFN-ELM which takes somewhat more time than the SLFN but
in the testing results are quite higher than the SLFN.
Since the ELM-AE can be a counter parts for the deep networks since
they take very small amount of time in comparison to them.
The ELM predicted quite better for the music app than the BP.
Future Works:
The TLFN can be made to work upon various other datasets and
other algorithms. the architecture can be refined to make it predict
better for the load shedding data set.
For the music app, the facial expression detection can be applied to
learn the mood of the person.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 79 / 83
80. References
Cortes C, Vapnik V. Support vector networks. Mach Learn. 1995;20(3):27397.
Suykens JAK, Vandewalle J. Least squares support vector machine classifiers.
Neural Process Lett. 1999;9(3):293300.
Huang G-B, Zhu Q-Y, Siew C-K. Extreme learning machine: a new learning scheme
of feedforward neural networks. In: Proceedings of international joint conference on
neural networks (IJCNN2004), vol. 2, (Budapest, Hungary); 2004. p. 985990, 2529
July.
Huang G-B, Zhu Q-Y, Siew C-K. Extreme learning machine: theory and
applications. Neurocomputing. 2006;70:489501.
Huang G-B, Chen L, Siew C-K. Universal approximation using incremental
constructive feedforward networks with random hidden nodes. IEEE Trans Neural
Netw. 2006;17(4):87992.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 80 / 83
81. References
Huang G-B, Chen L. Enhanced random search based incremental extreme learning
machine. Neurocomputing. 2008;71:34608
Bartlett PL. The sample complexity of pattern classification with neural networks:
the size of the weights is more important than the size of the network. IEEE Trans
Inform Theory. 1998;44(2):52536.
Rosenblatt F. The perceptron: a probabilistic model for information storage and
organization in the brain. Psychol Rev. 1958;65(6):386408.
Serre D. Matrices: theory and applications. New York:Springer; 2002.
Rao CR, Mitra SK. Generalized Inverse of matrices and its applications. New
York:Wiley; 1971.
Huang G-B, Ding X, Zhou H. Optimization method based extreme learning
machine for classification. Neurocomputing. 2010;74:15563.
Bai Z, Huang G-B, Wang D, Wang H, Westover MB. Sparse extreme learning
machine for classification. IEEE Trans Cybern. 2014.
Huang G-B, Li M-B, Chen L, Siew C-K. Incremental extreme learning machine with
fully complex hidden nodes. Neurocomputing. 2008;71:57683.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 81 / 83
82. References
Huang G-B, Chen L. Convex incremental extreme learning machine.
Neurocomputing. 2007;70:305662.
Werbos PJ. Beyond regression: New tools for prediction and analysis in the
behavioral sciences. Ph.D. thesis, Harvord University; 1974.
Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error
propagation. In: Rumelhart DE, McClelland JL, editors. Parallel distributed
processing: explorations in the microstructures of cognition, vol: foundations.
Cambridge, MA: MIT Press; 1986. p. 31862.
Rumelhart DE, Hinton GE, Williams RJ. Learning representations by
back-propagation errors. Nature. 1986;323:5336.
Werbos PJ. The roots of backpropagation : from ordered derivatives to neural
networks and political forecasting. New York:- Wiley; 1994.
Nimai Chand Das Adhikari (IIST) ELM-AE 6th June, 2016 82 / 83