Neural Networks on Steroids

Neural Networks on Steroids
Adam Blevins
April 10, 2015

Preface
This report represents an introduction to Neural Networks; what Neural Networks are, how they
work, simple examples, limitations and current solutions to these limitations.
Chapter 1 considers the motivation behind research into Neural Networks. The basic
architecture of simple Neural Networks are investigated and the most common learning
algorithm in practice, the Backpropagation algorithm, is discussed.
Chapter 2 applies the theory from Chapter 1 to an example Neural Network designed to map
x → x2
followed by analysis and evaluation of the results produced.
Chapter 3 discusses the Universal Approximation Theorem, a theorem that states a certain
type of simple Neural Network has the capability to approximate any function under
particular conditions. This continues to describe how training such a network has its
own diﬃculties and we provide solutions to avoiding these problems.
Chapter 4 motivates the desire to use more complicated Neural Networks by increasing their
size. Unsurprisingly this yields two fundamental training problems with the
Backpropagation algorithm called the Exploding Gradient problem and Vanishing Gradient
problem.
Chapter 5 considers a particular training technique called greedy unsupervised layerwise
pre-training which avoids the Exploding and Vanishing Gradient problems. This gives
us a much more reliable and accurate Neural Network model.
Chapter 6 concludes the investigation of this report and suggests possible topics for
future work.
1

Declaration
“This piece of work is a result of my own work except where it forms an assessment based on group
project work. In the case of a group project, the work has been prepared in collaboration with other
members of the group. Material from the work of others not involved in the project has been
acknowledged and quotations and paraphrases suitably indicated.”
2

Contents
1 An Introduction to Neural Networks 5
1.1 What is a Neural Network? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Uses of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 The Perceptron and the Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . 7
1.3.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 The Workings of an MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.1 The Backpropagation Training Algorithm . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Initial Setup of a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 A Single Layer MLP for Function Interpolation 15
2.1 Aim of this Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Software and Programming Used . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Great Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Incredulous Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Common Poor Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.4 Other Interesting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.5 Run Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Universal Approximation Theorem 25
3.1 Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Sketched Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 A Statistical View on Network Training: Bias-Variance Tradeoff . . . . . . . 31
3.3.2 Applying Overfitting and Underfitting to our Example . . . . . . . . . . . . . 33
3.3.3 How to Avoid Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . 35
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Multiple Hidden Layer MLPs 38
4.1 Motivation for Multiple Hidden Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Example Problems for Multiple Hidden Layer MLPs . . . . . . . . . . . . . . . . . . 42
4.2.1 Image Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3

4.2.2 Facial Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Vanishing and Exploding Gradient Problem . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Further Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Autoencoders and Pre-training 50
5.1 Motivation and Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 The Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.1 What is an Autoencoder? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.2 Training an Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.3 Dimensionality Reduction and Feature Detection . . . . . . . . . . . . . . . . 51
5.3 Autoencoders vs Denoising Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.2 The Denoising Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Stacked Denoising Autoencoders for Pre-training . . . . . . . . . . . . . . . . . . . . 54
5.5 Summary of the Pre-training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Empirical Evidence to Support Pre-training with Stacked Denoising Autoencoders . 59
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Conclusion 61
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A Single Layer MLP Python Code 68
B Python Output Figures from Section 2.3.4 73
4

Chapter 1
An Introduction to Neural Networks
1.1 What is a Neural Network?
Neural Networks (strictly speaking, ’artiﬁcial’ Neural Networks and henceforth referred to as ANNs)
are so called because they resemble the mammalian cerebral cortex of the brain [1]. The nodes
(sometimes called neurons, units or processing elements) of an ANN represent the neurons of the
brain. The weighted interconnections between nodes symbolise the communicative electrical pulses.
The following image represents a basic, directed ANN:
Input
Hidden
Output
Arrows are weighted connections
Figure 1.1: An example of a simple directed ANN
As shown above, ANNs consist of an input layer, some hidden layers and an output layer. De-
pending on the use of the network, one can have as many nodes in each layer and as many hidden
layers as desired (although it will be seen later the disadvantages of having too many hidden layers
or nodes).
5

One use of such a network is image recognition. Say you have an image and you would like to
classify said image, for example, is there an orange in this picture or a banana? An image is made
up of a number of pixels all with an associated rgb(red-green-blue) colour code. We can define our
network to have the same number of input nodes as pixels in the image, allowing one and only one
pixel to enter each input node. The hidden nodes process the information received via the weighted
connections, potentially picking out which colour is most prominent in total in the entire image, and
then hopefully the output would give you the correct label - banana or orange. This idea could be
extrapolated to handwriting recognition such as the technology available in the Samsung Galaxy
Note, which allows a user to write with a stylus where the technology recognises these inputs as
certain letters to write computerised documents.
This is where machine learning comes in. ANNs are built to adapt and learn to the information
they are given. If we extend the orange-banana example, we could teach our network to tell the
difference between a banana and orange by giving it an arbitrary number of images with a banana
or orange in them, and telling the network what the target label is. This set of images would be
called a training set and this method of learning would be called supervised learning (i.e. we give
the network an input and a target). In this example, the network will likely hone in on the colour
difference to deliver its verdict. Once trained, all future inputted images to the network should give
a reliable label.
1.2 Uses of Neural Networks
Neural Networks are everywhere in technology. Some additional examples to the image recognition
from above include:
1. The autocorrect on smartphones. Neural Networks learn to adapt to the training set given to
them and therefore, if storable on a smartphone, have the ability to adapt a dictionary to a
user like autocorrect.
2. Character recognition. This is extremely popular with the idea of handwriting with a stylus
on tablets and phones these days as mentioned before.
3. Speech recognition. This has become more powerful in recent years. Bing has utilised Neural
Networks to double the speed of their voice recognition on their Windows phones [2].
4. A quirky use includes, and I quote, “a real-time system for the characterisation of sheep
feeding phases from acoustic signals of jaw sounds” [3]. This was an actual research article in
the Australian Journal of Intelligent Information Processing Systems (AJIIPS), Vol 5, No. 2 in
1998 by Anthony Zaknich and Sue K Baker. Eric Roberts’ Sophomore Class of 2000 reported
online that radio microphones attached to the head of the sheep allow for chewing sounds to be
transmitted and comparing this with the time of day, allows an ANN to predict future eating
times [4]. If anything this demonstrates the versatility of ANNs.
5. Generally, a Neural Network can take a large number of variables which appear to have no
conceivable pattern and find associations or regularities. This could be something extremely
unusual like football results coinciding with a person watching (or not watching) the game.
6

1.3 The Perceptron and the Multi-Layer Perceptron (MLP)
1.3.1 Perceptron
The perceptron is the simplest ANN. It only consists of an input and output layer. It was first
conceived by Rosenblatt in 1957 [5]. It was named as such because its invention was to model
perceptual activities, for example responses of the retina. This was the basis of Rosenblatt’s research
at the time. Quoted from a journal entry by Hervé Abdi from 1994 [6]:
“The main goal was to associate binary configurations (i.e. patterns of [0, 1] values) presented as
inputs on a (artificial) retina with specific binary outputs. Hence, essentially, the perceptron is
made of two layers the input layer (i.e., the “retina”) and the output layer.”
The perceptron can successfully fulfil this need and works as such:
inj =
i
xiwij
x1
w1j
x2
w2j
oj = aj
Figure 1.2: The perceptron with two input nodes and one output node
where:
• inj is the total input of the j-th node which in this case is the sum of the weighted connections
multiplied by their respective inputs:
inj =
i
xiwij (1.3.1)
• xi is the input of the i-th connection
• wij is the weight of the connection from node i to node j
• oj =: output of the j-th node
• aj is some activation function to be defined by the programmer e.g. tanh(inj). This becomes
more apparent in an MLP where there exist a greater number of layers.
Usually with one output node like in Figure 1.2, oj = inj. Notice that there are two input nodes
because the intended inputs were binary configurations of the form [0, 1].
If the data is linearly separable, the perceptron convergence theorem proves that a solution
can always be found. This proof is omitted (but can be found via the link in the bibliography
[7]). However, the output node is a linear combination of the input nodes and hence can only
differentiate between linearly separable data. This is a severe limitation.
7

We will use the non-linearly separable logical XOR function as an example to show non-
linearly separable functions have no perceptron solutions. The logical XOR function is defined
as:
[0, 0] → [0]
[0, 1] → [1]
[1, 0] → [1]
[1, 1] → [0] (1.3.2)
We can see it is not linearly separable by the following diagram:
(0,0)
(1,1)
= value [0]
= value [1]
Figure 1.3: There is no way to separate the grey dots and the black dots
The original proof showing perceptrons could not learn non-linearly separable data was by Marvin
Minsky and Seymour Papert, published in a book called “Perceptrons” in 1969 [8]. However, the
following proof showing the perceptron’s inability to differentiate between non-linearly separable
data is quoted from Hervé Abdi’s journal entry in the Journal of Biological Systems in 1994, page
256 [6]:
Take a perceptron of the form Figure 1.2 and define our weights as w1 and w2. The inputs
are of the form [x1, x2]. The association of the input [1, 0] → [1] implies that:
w1 > 0 (1.3.3)
The association of the input [0, 1] → [1] implies that:
w2 > 0 (1.3.4)
Adding together Equations 1.3.3 and 1.3.4 gives:
w1 + w2 > 0 (1.3.5)
Now if the perceptron gives the response 0 to the input pattern [1, 1], this implies that:
w1 + w2 ≤ 0 (1.3.6)
Clearly, the last two equations contradict each other, hence no set of weights can solve the XOR
problem.
Due to the severe limitations of the perceptron, the multi-layer perceptron (MLP) was intro-
duced.
8

1.3.2 Multi-Layer Perceptron (MLP)
The MLP is an extension of the perceptron. Hidden layers are now added between the input
and output layers to gift the ANN a greater flexibility in calculations. Each additional weighted
connection is another parameter which allows for more complex problems and calculations in which
linear solutions just cannot help. Every aspect between an MLP and a perceptron is the same,
including the underlying node functions, but most importantly an MLP allows for non-linearly
separable data to be processed.
An MLP features a feedforward mechanism. A feedforward ANN is one which is directed
such that each layer may only send data to following layers but not within layers. Additionally, the
MLP is fully connected from one layer to the next. The following diagram is one of the simplest
MLP’s possible regarding the logical XOR function, taken from a Rumelhart et al. report from
1985 [9]:
1.5 0.5
Inputs Hidden
Unit
Output
Unit
x1
x2
+ 1
+ 1
+ 1
+ 1
- 2
f(inj)
g(ink)
Figure 1.4: An example of an MLP solution to the logical XOR function. Explanation in the
following text.
All of these nodes work exactly like Figure 1.2. Each node takes the sum of its inputs multiplied by
the weights on the connections. However, you will notice that two of the nodes contain numbers.
This is an indication of the output function on these specific nodes, oj = f(inj) and ok = g(ink).
Sometimes called the threshold function, the hidden unit’s output is basically saying the following:
f(inj) =
1 if 1.5 < inj
0 if inj < 1.5
(1.3.7)
Similarly, the threshold function for the output unit is:
g(ink) =
1 if 0.5 < ink
0 if ink < 0.5
(1.3.8)
The threshold function is effectively deciding whether the node should be activated and thus the
node’s output function is referred to as an activation function. The superiority of the MLP over the
simple perceptron is immediately clear. Having added just one extra unit to the architecture, we
9

obtain a successful ANN which solves the logical XOR function, something the perceptron cannot do.
Due to the greater flexibility of the MLP over the perceptron, it is difficult to see immedi-
ately the best values for the weights and sometimes even how many hidden nodes are ideal.
Typically, an ANN learns socratically and there are a number of ways to train MLPs to find the
best values (such as dropout, unsupervised learning, autoencoders etc.). Before considering training
methods, let’s first understand the inner calculations of an MLP.
1.4 The Workings of an MLP
I have described how the nodes calculate the total input inj and explained with notation, the idea
of a threshold function as an activation function. Although a threshold function can be extremely
useful in certain situations, the activation functions commonly used are one of the two following
sigmoids:
aj(inj) =
1
(1 + e−inj )
(logistic function) (1.4.1)
and
aj(inj) = tanh(inj) (hyperbolic tangent) (1.4.2)
These functions are usually applied to the hidden layer nodes but can be applied to the output
nodes as desired. It should be noted that the logistic function is bounded between 0 and 1, and the
hyperbolic tangent is bounded between -1 and 1. This is especially important when initialising the
weights of a network because if you expect to get a large output, you may need large weights to
compensate due to the activation function. The idea of using such functions allows for non-linear
solutions to be produced which in turn allows a more competent and functional ANN.
Now that our networks can use these activation functions that allow non-linear solutions, it
is helpful to add nodes which allow for a linear change in the solution, i.e. bias nodes. The action
of this node is to add a constant input from one layer to the following layer. It is connected like
all nodes, via a weighted connection, to allow the network to correct its influence as required. The
bias can take any value, but a value of 1 is common. The notation we shall use to represent a bias
node will be b used much like the threshold function notation, i.e. the b shall be within the node on
a diagram.
There are a number of ways to train an ANN and we will now investigate the most com-
monly used training algorithm, Backpropagation which is short for backpropagation of errors.
ANNs which employ this algorithm are usually referred to as Backpropagation Neural Networks
(BPNNs).
1.4.1 The Backpropagation Training Algorithm
The Backpropagation algorithm was first applied to Neural Networks in a 1974 PhD Thesis by Paul
J. Werbos [10], but its importance was not fully understood until David Rumelhart, Geoffrey Hinton
and Ronald Williams published a 1986 paper called “Learning representations by back-propagating
errors” [11]. It was this paper that detailed the usefulness of the algorithm and its functionality,
and it was these men that sparked a comeback within the neural network community for ANNs,
10

inspiring successful deep learning (i.e. successful learning of ANNs with hidden layers). They were
able to show that the backpropagation algorithm could train a network so much faster than earlier
developments of learning algorithms that previously unsolvable problems were now solvable. The
large increase in efficiency meant a massive set of training data was not essential, allowing for a
more attainable training set for such problems.
The remainder of this subsection is a detailed description of the Backpropagation training
algorithm. It is heavily based upon the 1986 Rumelhart, Hinton and Williams article mentioned
above [11], however notation has been altered for consistency and explanations have been expanded.
It should also be noted that one is not expected to understand the following equations immediately.
Naturally it is expected to take significant experience and examples to fully understand their
meaning.
Our aim is to find appropriate weights such that the input vector delivered to the ANN re-
sults in a sufficiently accurate output vector to our target output vector for the entirety of our
training data. The ANN will then have the ability to “fill in the gaps” between our training set
to provide a smooth, accurate function fitted to purpose. Defining the input layer as layer 0, let’s
define each individual weight with the notation w
(l)
ij where l represents the layer the weighted
connection is entering (i.e. a weight with index l = 1 means the weighted connection will enter a
node in layer 1), i represents a node from layer l − 1 and j represents a node from layer l. Thus the
total input for each node, defined as the function in Equation 1.3.1, can be rewritten as:
in
(l)
j =
i
a
(l−1)
i w
(l)
ij (1.4.3)
where a
(l−1)
i represents the activation function output from node i in layer l − 1. This algorithm
eventually differentiates the activation function to instigate the fundamental concept of the
Backpropagation algorithm, gradient descent. Any activation function can be used, as long as it has
a bounded derivative. Also note that the resulting value of the activation function for a node is the
respective output of said node.
Finally, claiming our ultimate layer is l = L we shall define the output vector of the net-
work as a(L)
(xi) where xi is the corresponding input vector and this a is still referring to the
activation function. Thus this yields our error function of the network for a single training sample:
Ei =
1
2
ti − a(L)
(xi)
2
2
(1.4.4)
where i represents an input-target case from our training data and ti is our respective target vector
for xi.
The idea of gradient descent is to take advantage of the chain rule by finding the partial
derivative of the error, Ei, with respect to each weight, w
(l)
ij . This will then help minimise E. We
will then update the weights as such:
∆w
(l)
ij = −η
∂Ei
∂w
(l)
ij
(1.4.5)
for some suitable value of η (referred to as the learning rate). This learning rate is there to control
the magnitude of the weight change. The negative sign is by convention to indicate the direction of
11

change should be towards a minimum and not a maximum. Ideally, we want to minimise the error
via the weight changes to get an ideal solution. To see this equation visually, let’s imagine we have
a special network in which the Error function only depends on one weight. Then the Error function
could look something like this:
Figure 1.5: An example Error function which depends on one weight
To get the smallest possible E we need to find the global minimum. It is possible for there to be
many local minima that you want to avoid. The idea behind the learning rate is to find a balance
between converging on the global minimum and “jumping” out of or over the local minima. Too
small a learning rate and you might remain stuck as illustrated by the orange dot, too big and you
may overshoot the global minimum entirely as shown by the blue dot.
Unfortunately it is difficult to see the ideal learning rate from the outset and it is common-
place to use trial and error to find the optimal η. A regular starting point would be a value
between 0.25 and 0.75, but it is not unusual to be as small as 0.001 if you have a simple function to
approximate.
Now to find the actual values of Equation 1.4.5. First of all we need to calculate ∂E/∂w
(l)
ij
for our output nodes. For each output node we have the error:
∂E
∂a
(L)
j
= − tj − a
(L)
j (1.4.6)
where j corresponds to the j-th output node and recalling a
(L)
j is the output of the j-th output
node. This is simple for the output layer of the ANN. It is just the output of the node minus the
target of the node. Calculating for previous layers becomes more difficult.
12

The next layer to consider is the penultimate layer of the ANN. Recalling the total input for
a node as follows:
in
(l)
j =
i
a
(l−1)
i w
(l)
ij (1.4.7)
where a
(l−1)
i is the output of the i-th node in layer l − 1. Using this equation we calculate for our
penultimate layer the following:
∂Ei
∂w
(L−1)
ij
=
∂E
∂in
(L−1)
j
∂in
(L−1)
j
∂w
(L−1)
ij
=
∂E
∂in
(L−1)
j
a
(L−2)
i (1.4.8)
We must now ﬁgure out the value for ∂E/∂in
(L−1)
j which is as follows:
∂E
∂in
(L−1)
j
=
∂E
∂a
(L−1)
j
∂a
(L−1)
j
∂in
(L−1)
j
=
∂E
∂a
(L−1)
j
aj (in
(L−1)
j ) (1.4.9)
where aj (in
(L−1)
j ) is the derivative of the chosen activation function. Recalling the activation
function depends only on the respective node’s input, for example the hyperbolic tangent activation
function from Equation 1.4.2 was just aj = tanh(inj), this is easily calculated. The layer of aj is
the same as the layer associated with inj (i.e. the layer the input is entering) but we remove this
index from aj to avoid a bigger mess of indices.
From here until the end of the subsection, credit must also be given to R Rojas [12] and A
Venkataraman [13] in addition to Rumelhart et al. [11]). Now we just need to calculate ∂E/∂a
(L−1)
j
in the above equation. Taking E as a function of inputs from all nodes K = 1, 2, ..., n receiving
input from node j:
∂E
∂a
(L−1)
j
=
k∈K
∂E
∂in
(L)
k
∂in
(L)
k
∂a
(L−1)
j
=
k∈K
∂E
∂a
(L)
k
∂a
(L)
k
∂in
(L)
k
∂in
(L)
k
∂a
(L−1)
j
=
k∈K
∂E
∂a
(L)
k
ak (in
(L)
k )w
(L)
jk (1.4.10)
This same formula can be used for weights connecting to layers before the penultimate layer. Thus
we can now ﬁnd how to change any weight in the whole network. We can therefore conclude the
following:
∂E
∂w
(l)
ij
=
∂E
∂a
(l)
j
∂a
(l)
j
∂in
(l)
j
a
(l−1)
i =
∂E
∂a
(l)
j
aj (in
(l)
j )a
(l−1)
i
with:
∂E
∂a
(l)
j
=



(a
(L)
j − tj) if j is a node in the output layer
k∈K
∂E
∂a
(l+1)
k
ak (in
(l+1)
k )w
(l+1)
jk if j is a node in any other layer
(1.4.11)
13

The appropriate terms are substituted into Equation 1.4.5 and allows for full training of the system.
The time it takes the network to run through all input-target cases once is defined as an epoch. The
most common way to update the weights is after each epoch. After a pre-defined number of epochs
the network will stop training. This could be any number but 1000 is a good stopping point to
prevent the system taking too long to train but also allowing plenty of time for convergence to the
ideal solution.
1.4.2 Initial Setup of a Neural Network
An ANN will be programmed into a computer and could be on any language. This is fortunate
because we don’t have to worry about calculating the gruelling equations in the previous subsection
ourselves. In the following chapter we will see an example of an MLP with its own solutions and the
problems it faces. There are a few things we should consider first. From the outset it is not always
clear how exactly to initialise the network. You will find it difficult to guess the ideal weights, it may
be difficult to estimate the number of epochs to run through before ceasing training of the network
etc. but here are a few ideas to help get your head around it.
• It is ideal to set up your network with weights that are randomised between certain values. The
randomised values should not have an origin (be seeded in programming language) because
this allows each run through to give a different set of results. This is ideal in finding the
best possible set up for your ANN. You can only estimate your weights based upon previous
examples you may have seen and the expected outputs from your network.
• With regards to the epochs, it will depend entirely upon the size of your training set and the
power of your computer. The more epochs the better for converging networks, but not so
many that you have to wait too long for a solution because time constraints as well as data
constraints is what caused ANNs to fall out of popularity in the ’70s.
• The learning rate is a difficult one to guess but it is best to start small. This is because you are
almost guaranteed to converge to some minimum given enough epochs. Learning rates that
are too large always have the opportunity to find the global minimum but jump back out too.
• Bias nodes are recommended as one per layer (except the output layer). You won’t need more
than one as you can adjust the influence of the bias. This allows a lot more freedom for error
correction.
• With regards to the training data, you will want to leave maybe 10-20% of it aside to test the
network once trained with. This helps to establish how accurate the network is. The training
data should also be normalised to be within the set [-1, 1]. This helps stabilise the ANN with
regards to the activation functions used and allows for smaller weights. Smaller weights means
greater accuracy. Thinking back to how the weights are adjusted, a bigger weight has greater
likeliness of being adjusted a significant percentage, and with a lot of weights to adjust this
can impact convergence and training time.
The next chapter focusses on an example network written in Python and the problems faced in
finding the convergence on the global minimum we need for an accurate network using the MLP
architecture.
14

Chapter 2
A Single Layer MLP for Function
Interpolation
2.1 Aim of this Example
The aim is to teach a single layer feedforward MLP to accurately map x → x2
≡ f(x), where
x ∈ [0, 1] ⊂ R, using the Backpropagation algorithm for training.
2.1.1 Software and Programming Used
The code has been adapted in Enthought Canopy [14] and written in the language Python using
the version distributed by Enthought Python Distribution [15]. The code, heavily based on a Back-
propagation network originally written by Neil Schemenauer [16], is attached in Appendix A, fully
annotated.
2.2 Method
1. The ANN is to learn via the Backpropagation algorithm as discussed in Section 1.4.1 and
therefore needs training data. The training data used is the following:
x ∈ {0, 0.1, 0.2, ..., 1} and their respective targets f(x) = x2
(2.2.1)
For simplicity, the entire training set is used to train the network and then used once again
to test the network. Python outputs these test results once the network completes its training
in such a format “([0], → , [0.02046748392039354])”, for each input data. A graph is then
generated showing 100 equally spaced points’ resultant network output for x ∈ (0, 1) compared
with the function f(x) = x2
.
2. The sigmoid function for this network is the logistic function from Equation 1.4.1.
3. The learning rate is set to η = 0.5.
4. The number of epochs is set to 10000. The code tells Python to give the current error of the
network, for every 1000th epoch, to 9 decimal places.
5. The weights are initialised using the seed() function from the random module [17]. This
function is a pseudo-random number generator that uses a Gaussian distribution with standard
15

deviation 1 and mean based upon the system time when the network is ran. The weights
from the input nodes to the hidden nodes are specified to be randomly distributed in the
interval (−0.5, 0.5), and similarly the weights connecting hidden and output nodes are randomly
distributed in the interval (−5, 5). The latter weights have a greater randomisation range
because the logistic function has an upper bound of 1. Thus for larger outputs we need larger
weights and from experimentation, these values can provide very accurate results.
The network structure of this ANN is based upon an example from Kasper Peeters’ unpublished
book, Machine Learning and Computer Vision [18] and takes the following architecture:
x
b = 1
σ
σ
b = 1
Inputs
l = 0
Hidden Units
l = 1
Output Unit
l = L = 2
f(x)
where b = 1 indicates the node output’s a bias of value 1, σ indicates the node’s activation function
is the logistic function and recalling l references the layer of the network with l = L corresponding
to the final layer. It should be noted that many different structures could be used, for example the
inclusion of a greater number of hidden nodes, but this simple structure succeeds.
2.3 Results and Discussion
The following are examples of results obtained from running the Neural Network program.
2.3.1 Great Result
Figure 2.1: Example of a great learning result
Figure 2.1 shows a graph that plots x against
f(x) = x2
. The magenta line represents x2
and
the black dotted line represents the network’s pre-
diction for the 100 equidistant points between 0
and 1 after training. Generally, a result like this
is generated from a final error less than 2 ∗ 10−4
.
This is an especially successful case which can be
seen in Figure 2.2. The figure shows the network
results of the test data and delightfully each gives
the correct result when rounded to two decimal
places.
16

Figure 2.2: The “Great Result’s” respective Python output
Turning to the respec-
tive Python output
in Figure 2.2, it is
interesting to note the
size of the weights.
They are all relatively
small (less than 5).
However, the input
to hidden weights
have significantly in-
creased (recalling that
they were initialised
randomly in the set
(−0.5, 0.5)) and the
hidden to output
weights have relatively
decreased (initialised
between (−5, 5)).
Fascinatingly, if the
initialisation of the
weights were switched
so that the input to
hidden weights were randomly in the interval (−5, 5) as well, the network becomes significantly
more unreliable to the extent that in 100 attempts, no run had error less than 2 ∗ 10−4
. But why?
Presumably this is because the increased size of weight increases the inputs significantly to the
hidden nodes. This means the logistic function’s output is generally larger which leads to a larger
network output, further away from our target data. This would cause larger error and therefore
cause a greater magnitude of change in the weights of the network which can affect the network’s
ability of convergence on the global minimum. This indicates that keeping the weights generally
initialised smaller allows for a more stable training algorithm. Another notable detail is the first
error output. It is relatively small at just over 3 and this plays a part in comparison for the other
results.
2.3.2 Incredulous Result
Figure 2.3: Example of an incredulous result
This solution is clearly not ideal but it does pro-
vide some very interesting insight in to the prob-
lems faced with teaching an ANN. Figure 2.3 only
appears to be accurate for 2 of the 100 points
tested on the trained network. However, posi-
tively this type of result only occurred once in
100 runs. A result like this can only be gener-
ated for large final error with the training be-
ginning with initial convergence but ultimately
diverging. In this case, it causes the network to
train to the shape of the logistic function, the
sigmoid function chosen for this network. This
figure also shows that the bulk of the 100 points
17

tested give an output between 0 and 0.5. Referring to Figure 2.4, the results for the test data show
that for 0.7 and above, the network overestimates the intended output significantly and the test data
Figure 2.4: The “Incredulous Result’s” respective Python output
for 0.6 and under are
generally underesti-
mated. This provides
the significant bias
towards a smaller
output from the net-
work. In comparison
to the aforementioned
“Great Result”, the
initial output error for
this run through was
significantly higher,
an increase of over
900 %!
But why such a
different result? Both
networks are initialised
with random weights
which means both
begin with a different
Error function. This
means each network is
attempting to find a
different global minimum using the same learning rate. Although the learning rate η was ideal for
the “Great Result”, it was clearly not ideal for this initial set of weights in which the learning rate
was unable to escape a local minimum in the Error function. Alternatively, the weight adjustment
calculation as in Equation 1.4.5 gives a large partial derivative term. This causes a huge change
in weights which in turn causes the error function to jump away from the global minimum we’re
aiming for. It is difficult to tell if this meant the learning rate was too large or too small. However,
if we compare these theories to the error output in Figure 2.4 we see the error converges initially
but then starts to diverge. The initial weight change allowed us to reach close to a minimum but
the ultimate divergence suggests this was a local minimum and hence the learning rate was in fact
too small to overcome the entrapment of this trough.
As discussed before, increasing the learning rate risks the likelihood of jumping away from
the global minimum altogether. If the number of epochs was increased for this run through,
eventually the local minimum could be overcome but the convergence is too slow for this to be
worthwhile. It is for this reason that in practice, multiple run throughs for the same network are
undertaken.
One should also notice the difference in weights between the two results. The “Great Re-
sult” had smaller weights, suggesting a stable learning curve. This “Incredulous Result” did not
have a stable learning curve due to the divergence and this is clear in the significant weight increases.
All the input to hidden weights in Figure 2.4 are greater than the respective weights in Figure
18

2.2. This demonstrates the idea that larger initialised weights do not lead to more accurate results
and actually lead to a more unstable network. Furthermore, the hidden to output weights for the
“Incredulous Result” are very small which leads to the bulk of the network outputs to lie between
0 and 0.5.
2.3.3 Common Poor Result
Figure 2.5: Example of a poor learning result
This next result is almost as common as the
“Great Result”. It seems that, despite the im-
plications of the “Incredulous Result”, naturally
this Neural Network setup is able to predict
outputs with greater accuracy and consistency
nearer x = 1 than 0 and thus we get a tail be-
low f(x) = 0 when approaching x = 0 regu-
larly. Generally for function interpolation, the
network will attempt to find a linear result to
correlate the training data to its target data and
therefore the network can regularly attempt to
draw a best fit line due to the initialised weights
as opposed to the quadratic curve we require.
The logistic function gives us the ability to find
non-linear solutions but if the total inputs to the logistic nodes become too large then the output
can be on the tails of the logistic function which is effectively linear as shown:
Figure 2.6: The logistic function
This once more demonstrates the desire to initialise smaller weights for a network as well as
normalising training data.
The test data results from Figure 2.7 show for inputs greater than 0.2, the network output
has consistently overestimated which corresponds to Figure 2.5. This can again be associated to
the starting error. The initialised weights were accurate enough to allow the network to almost
instantly fall into a trough near a minimum. Unfortunately, the lack of any real convergence to a
smaller error suggests that this was a local minimum. The initial error was so small that ∆w
(l)
ij could
19

Figure 2.7: The poor result’s respective Python output
be negligible for some
of the weights. This
can cause the network
to get stuck in a
local minimum and
therefore unable to
jump out leaving no
chance to ﬁnd the
global minimum.
As was the case
with the “Incredulous
Result”, there are
some weights which
are very large rela-
tive to our “Great
Result’s” respective
weights. Once more
this can cause insta-
bility in the learning
algorithm which stag-
nates convergence. If
one weight is much
larger than the others (which is the case here) it has the controlling proportion of the input to the
following node, almost making all other inputs obsolete. This gives a lot of bias to one input and
therefore makes accurate training a lot more diﬃcult for the smaller weights around it. Fortunately
this is likely down to the initialised weights. One weight could have been randomised much higher
relative to the others which causes this downfall. This highlights the importance of the range in
which the weights are randomised once more.
20

2.3.4 Other Interesting Results
(a) A network whose error converged for the first
step, but diverged slowly from then on.
(b) A network whose error started very small and
converged slowly.
Figure 2.8: Two more examples of an inaccurate interpolation by the network. Their respective
Python outputs are placed in Appendix B
Figure 2.8a shows accuracy for half the data but after the error started increasing halfway
through training, the network tries to correct itself almost through a jump. This occurs be-
cause the learning rate is now too small to escape the local minima in the restricted number of
epochs and thus the error continues to increase. The algorithm terminates before escape from
the local minima and yields a generally large error for the system at 3.1 ∗ 10−3
. We can rec-
tify this by increasing the learning rate slightly but overall this comes down to the initialised weights.
Figure 2.8b describes a network in which the initialised weights gave a relatively small first
error of roughly 0.14. The learning rate η is small and the error changes with respect to the weight
changes are small. This means ∆w
(l)
ij from Equation 1.4.5 will be very small. Thus convergence will
be slow. This can be overcome by an increased learning rate or increasing the number of epochs
but once again, the most important factor is the weight initialisation.
2.3.5 Run Errors
Occasionally the network fails to train entirely and the algorithm terminates prematurely. These are
called run errors and ours occur because the math range available to Python to compute equations is
bounded. Specifically, the power of our exponential in the logistic function, e, may only have powers
between −308 and 308 (This can be found by typing “import sys” in to the Python command
line, followed by a second command “sys.float info”). The following Python output shows such an
error occurrence and the final line states math range error. The remaining jargon describes the
error route through the Python code, starting with the initialisation command and stemming at the
logistic function definition:
21

Figure 2.9: An example run error Python output from the network
Mathematically, we can see this happening. In the lines numbered 85-87, which are boxed in white,
the code has deﬁned what we called in
(l)
j . For this network, we only have one layer using the sigmoid
function, i.e. l = 1. Recall the following equations:
inj =
i
xiwij (2.3.1)
and also
sig(sum = inj) = aj(inj) =
1
(1 + e−inj )
(2.3.2)
If |inj| > 308 then the ﬂoat is out of computing range and causes the premature termination of the
network’s learning algorithm. If we consider this in terms of limits:
lim
inj→−∞
sig(sum) = 0 (2.3.3)
Therefore as inj → −∞ the output layer would only receive an input from the bias node and cause
giant error. As the bias node is only a constant, the network cannot converge on the targets for our
input data because our target is quadratic. Similarly this is the case for the upper bound:
lim
inj→+∞
sig(sum) = 1 (2.3.4)
22

If inj is sufficiently large then the logistic nodes essentially become their own bias nodes and the
same problem occurs. It should be noted that the size of the input to the logistic nodes are based
upon the initialised weights which are randomly distributed between −0.5 and 0.5. The training data
only has inputs between 0 and 1. Therefore, to cause |inj| > 308, the adjustment of the weights in
training must cause significant change and as this adjustment, ∆wij, depends on the change in error
with respect to the weights, the network must either be diverging and hence changing the weights
more and more each epoch, or the network is stuck in a local minimum and effectively just adding
a constant weight change epoch after epoch. The former can be seen in Figure 2.10a and the latter
in Figure 2.10b.
(a) A run error caused by divergence from the global
minimum of ∂E/∂wij
(b) A run error caused by a network stuck in a rel-
atively large valued local minimum
Figure 2.10: Two more examples of run errors
In summary, run errors can occur, but this just indicates a network that would have been extremely
inaccurate. This comes down to the weight initialisation causing divergence from the global minimum
which further indicates the need for multiple run throughs to find the ideal solution.
2.4 Conclusions
In general, there are a great number of variables and factors that influence how efficient and
accurate an ANN will be. The learning rate is vital, the number of epochs play a part in accuracy,
the initialisation of the weights is paramount and the size of the training data set impacts the
success of the Backpropagation algorithm.
The most influential factor is certainly the range in which weights are initialised. Given an
appropriate set of starting weights, the system can either converge very quickly or diverge signifi-
cantly to the extent of a run error. This was a clear problem in a small network. Now if we imagine
an even bigger network with more weights to randomise and train, what will happen? Will an
increase in hidden nodes allow for a more accurate network or will the training become substantially
more difficult with the increased number of weights to be altered? We will investigate an answer to
these questions in the following chapter.
It can be argued that the second most important factor regarding successful training is the
23

learning rate. The “Incredulous Result” was stuck in a minimum of high error but the learning rate
was too small with regards to Equation 1.4.5 to escape. Each network has different requirements
and predicting an ideal learning rate is extremely difficult. The learning rate chosen for this example
was based on experimentation to find an η that ideally results in a trained network similar to the
“Great Result” for as high a proportion of run throughs as possible. One cannot consider this the
most important factor because it is a lot harder to predict than the weight initialisation range.
One can simply guess based on bounds of the sigmoid function as to the necessary weights needed
to output the magnitude of the target data. On the other hand the learning rate impacts after
such an initialisation on completion of the first epoch and therefore depends on this randomisation.
Therefore the weights are the most important factor. This begs the question - are there more
appropriate ways to initialise the weights than a random distribution within a bounded interval?
There certainly are, and this concept will be revisited later.
One should note that the training data is of significant importance too. The reason for rat-
ing its importance lower than the learning rate and weight initialisation is due to the fact that it
is unlikely one would want to build a network with a very limited training set to begin with on
the basis the result would be highly unreliable. If the training set is too small then naturally the
Backpropagation algorithm will struggle to get a good picture of the function we are trying to teach
it. This would indicate the need for a large number of epochs to ensure accuracy. Unfortunately
this leads to another problem called Overfitting. The network is taught to accurately predict the
training data, but can cause inaccuracy to all other data points. For example, take the network
from above. Instead of the network outputting a curve close to x2
it could output a curve similar
to the sine curve with period 0.2, waving through all the training points but being a completely
inaccurate estimation for all other points between 0 and 1. Extensive details of this phenomenon
shall be discussed in Chapter 3.
Due to such a number of impacting factors, research is very active. To conquer the prob-
lems faced, we must first discover the limitations of a single layer feedforward MLP and then
consider methods of countering them. We begin with the Universal Approximation Theorem.
24

Chapter 3
Universal Approximation Theorem
3.1 Theorem
The Universal Approximation Theorem (UAT) formally states:
Let σ(·) be a non-constant, bounded and monotonically increasing continuous function.
Let In denote the n-dimensional unit hypercube [0, 1]n
. Therefore define the space of
continuous functions on the unit hypercube as C(In). Then for any f(x) ∈ C(In) with
x ∈ In and some ε > 0, ∃N ∈ Z such that:
F(x) =
N
i=1
ciσ
n
j=1
wijxj + bi (3.1.1)
is an approximation realisation of the function f(x) where ci, bi ∈ R and wij ∈ Rn
.
Therefore:
|F(x) − f(x)| < ε ∀x ∈ In (3.1.2)
Given that our logistic function is a non-constant, bounded and monotonically increasing continuous
function we can directly apply this to a single hidden layer MLP. If we now claim that we have an
MLP with n input nodes and N hidden nodes, then F(x) represents the output of such a network
with f(x) our respective target vector given an input vector x. We can appropriately choose our
hidden to output connection weights such that they equal ci and let bi represent our bias node in
the hidden layer. Finally normalising our training data to be in the interval [0, 1] we have a fully
defined single hidden layer MLP with regards to this theorem.
Therefore we can directly apply the UAT to any single hidden layer MLP that uses a sig-
moid function in its hidden layer and can conclude that any function f(x) ∈ C(In) can be
approximated by such a network. This is extremely powerful. In Chapter 1 we began with a
perceptron which had just an input and output layer. This was unable to distinguish non-linearly
separable data. By adding in this one hidden layer in a feedforward network we can now not only
distinguish between non-linearly separable data, but under certain assumptions on our activation
function, can now approximate any continuous function with a finite number of hidden nodes.
3.2 Sketched Proof
Cybenko in 1989 was able to detail a proof in a book named “Approximation by Superpositions of a
Sigmoidal Function” for the UAT [19]. The aim of his paper was to find the assumptions necessary
25

for equations of the form Equation 3.1.1 to be dense in C(In). In our theorem we explain f(x) can
be approximated ε-close in C(In) and hence F(x) gives the dense property. We will now investigate
Cybenko’s 1989 paper to prove the conditions for the theorem to hold.
Definition: First of all, let’s define what it means for σ to be sigmoidal as in Cybenko’s
paper [19]. σ is sigmoidal if:
σ(x) −→
1 x −→ +∞
0 x −→ −∞
(3.2.1)
Notice this is exactly the case for the logistic function as shown in the previous chapter.
We can describe why Cybenko’s result should hold via a logical argument with reference to
a post made by Matus Telgarsky [20]. Firstly we recall that a continuous function of a compact
set is uniformly continuous. In is clearly compact and as our sigmoid function is defined to be a
continuous function over this interval, σ is uniformly continuous and thus can be approximated by
a piecewise constant function. In his post, Telgarsky describes how a piecewise constant function
can then be represented by a Neural Network as such [20]:
• An indicator function is defined as: Given a set X and a subset Y ⊆ X then for any x ∈ X
IY (x) =
1 if x ∈ Y
0 otherwise
(3.2.2)
• For each constant region of the piecewise constant function we can form a node within a Neural
Network that effectively acts as an indicator function, and multiply the node’s output by a
weighted connection equal to the constant required.
• We want to form this Neural Network such that it uses a sigmoidal function as defined as F(x)
in Equation 3.1.1 and to form such an indicator function using sigmoidal nodes we can take
advantage of the limits as defined above. Therefore the weighted connections acting as input
to this node can either be large positively or negatively to allow for the output to be arbitrarily
close to 1 or 0 respectively.
• The final layer of the Neural Network needs just a single node whose output is the sum of
these “indicators” multiplied by the appropriately chosen weights to approximate the piecewise
constant function.
This is what we shall now attempt to show mathematically.
Defining M(In) as the space of finite Borel measures on In we are in a position to explain
what it means for σ to be discriminatory.
Definition: σ is discriminatory if for a measure µ ∈ M(In) then ∀w ∈ Rn
and b ∈ R:
In
σ
n
j=1
wjxj + b dµ(x) = 0 (3.2.3)
implies that µ = 0. Notice this integrand takes the same form as in Equation 3.1.1. With this
definition we are now in a position to consider Cybenko’s first Theorem of his paper [19]:
26

Theorem 1: Let σ be a continuous discriminatory function. Then given any f ∈ C(In),
∃F(x) of the following form:
F(x) =
N
i=1
ciσ
n
j=1
wijxj + bi (3.2.4)
such that for some ε > 0
|F(x) − f(x)| < ε ∀x ∈ In (3.2.5)
This theorem is extremely close to the UAT but it does not impose the conditions on σ required for
the application to Neural Networks yet. We will now prove this theorem.
3.2.1 Proof of Theorem 1
To fully understand and investigate the proof in Cybenko’s 1989 paper we must begin by considering
two theorems; the Hahn-Banach theorem and the Riesz-Markov-Kakutani theorem whose proofs
will be omitted.
The Hahn-Banach Theorem [21, 22]: Let V be a real vector space, p : V → R a sub-
linear function (i.e. p(λx) = λp(x) ∀λ ∈ R+
, x ∈ V and p(x + y) ≤ p(x) + p(y) ∀x, y ∈ V ) and
ϕ : U → R a linear function on a linear subspace U ⊆ V which is dominated by p on U (i.e.
ϕ(x) ≤ p(x) ∀x ∈ U). Then there exists a linear function ψ : V → R of φ to the whole space V such
that:
ψ(x) = φ(x) ∀x ∈ U (3.2.6)
ψ(x) ≤ p(x) ∀x ∈ V (3.2.7)
Now defining Cc(X) as the space of continuous, compact, complex-valued functions on a locally
compact Hausdorff space X (A Hausdorff space means any two distinct points of X can be separated
by neighbourhoods [23]) we can state the Representation Theorem.
Riesz-Markov-Kakutani Representation Theorem [24, 25, 26]: Let X be a locally
compact Hausdorff space. Then for any positive linear functional ψ ∈ Cc(X) there exists a unique
Borel measure µ ∈ X such that:
ψ(f) =
X
f(x)dµ(x) ∀f ∈ Cc(X) (3.2.8)
With these two theorems we are now in a position to understand Cybenko’s proof of Theorem 1
(written in italics), adapted to our notation from his paper [19]:
27

Let S ⊂ C(In) be the set of functions of the form F(x) as in Equation 3.1.1. Clearly S
is a linear subspace of C(In). We claim that the closure of S is all of C(In).
Here, Telgarsky helps ﬁnd our route of argument [20]. The single node in the
output layer as deﬁned in our logical argument earlier is a linear combination of the
elements in the previous layer. The nodes in the hidden layer are functions and thus this
linear combination from the output node is also a function contained in the subspace
of functions spanned by the hidden layer’s outputs. This subspace contains the same
properties as the space spanned by the hidden node functions but we need to show it is
closed. Thus Cybenko is arguing that this subspace is not only closed but contains all
continuous functions by means of contradiction.
Assume that the closure of S is not all of C(In). Then the closure of S, say R,
is a closed proper subspace of C(In). By the Hahn-Banach theorem, there is a bounded
linear functional on C(In), call it L, with the property that L = 0 but L(R) = L(S) = 0.
By the Riesz(-Markov-Kakutani) Representation Theorem, this bounded linear function,
L, is of the form:
L(h) =
In
h(x)dµ(x) (3.2.9)
for some µ ∈ M(In) for all h ∈ C(In). In particular, since σ
n
j=1
wjxj + b is in R for
all w and b, we must have that:
In
σ
n
j=1
wjxj + b dµ(x) = 0 (3.2.10)
for all w and b.
However, we assumed that σ was discriminatory so that this condition implies
that µ = 0 contradicting our assumption. Hence, the subspace S must be dense in C(In).
This proof shows that if σ is continuous and discriminatory then Theorem 1 holds. All we need
to do now is show that for any continuous sigmoidal function σ, σ is discriminatory. This will then
give us all the ingredients to prove the Universal Approximation Theorem.
Cybenko gives us the following Lemma to Theorem 1:
Lemma 1: Any bounded, measurable sigmoidal function, σ, is discriminatory. In particular,
any continuous sigmoidal function is discriminatory.
The proof of this Lemma is heavily measure theory based and is omitted from this report,
however it can be found via the reference for Cybenko’s 1989 paper for perusal [19].
Finally, we can state Cybenko’s second theorem which gives us the Universal Approximation
Theorem and shows that an MLP with only one hidden layer and an arbitrary continuous sigmoidal
28

function allows for approximation of any function f ∈ C(In) to arbitrary precision:
Theorem 2: Let σ be any continuous sigmoidal function. Then given any f ∈ C(In),
∃F(x) of the following form:
F(x) =
N
i=1
ciσ
n
j=1
wijxj + bi (3.2.11)
such that for some ε > 0
|F(x) − f(x)| < ε ∀x ∈ In (3.2.12)
The proof for this theorem is a combination of Theorem 1 and Lemma 1.
3.2.2 Discussion
Interestingly, Cybenko mentions in his 1989 paper that typically in Neural Network applications,
sigmoidal activation functions are typically taken to be monotonically increasing. We assume
this for the UAT but for the results in Cybenko’s paper and the two theorems we investigated,
monotonicity is not needed. Although this appears an unnecessary condition on the UAT, a
monotonically increasing activation function allows for simpler approximating and is therefore a
sensible condition to include. If the activation function wasn’t monotonically increasing, it can be
assumed that generally training a network would either take longer or struggle to converge on the
global minimum of the Error function. This is because the activation function would have minima
and hence cause problems in the Backpropagation algorithm when updating weights because it
gives the ability for some weight changes to be 0 at the minima of the activation function.
In addition to this proof from Cybenko, it is worth noting that 2 years later in 1991, Kurt
Hornik published a paper called “Approximation Capabilities of Multilayer Feedforward Networks”
in which he proved that the ability for a single hidden layer MLP to approximate all continuous
functions was down to its architecture rather than the choice of the activation function [27].
With the Universal Approximation Theorem in mind we can conclude that if a single hidden
layer MLP fails to learn a mapping under the defined constraints, it is not down to the architecture
of the network but the parameters that define it. For example, this could be poorly initialised
weights, it could be the learning rate or it could even be down to an insufficient number of hidden
nodes to suitably approximate the function with too few degrees of freedom to produce a complex
enough approximation.
Another thing to note is this theorem just tells us we have the ability to approximate any
continuous function using such an MLP given a finite number of hidden nodes. It does not tell us
what this finite number actually is or even give us a bound. However, when we want to compute
extremely complex problems we are going to require a large number of hidden nodes to cope with
the number of mappings represented. The number of hidden nodes is important with regards to an
accurate network and we will now consider the consequences of a poorly chosen hidden layer size.
3.3 Overfitting and Underfitting
Once we require a huge number of nodes to solve complex approximations, the number of calcu-
lations the network has to do increases significantly in an MLP. Considering i input nodes and o
29

output nodes, adding one more hidden node increases the number of weighted connections by i + o.
Presumably if a large number of hidden nodes are required, a large number of input and output
nodes are also present. Ideally we want to minimise this number of increased calculations because
a Neural Network that takes days, weeks or even months to train is completely inefficient and one
of the reasons Neural Networks fell out of popularity in the ’70s. However, balancing training time
and efficiency with a network that can train accurately is surprisingly difficult.
Given a set of data points, the idea of a Neural Network is to teach itself an appropriate
approximation to data whilst generalising to unseen data accurately. Unlike in Chapter 2 in which
the network was trained using perfect targets for the training data, in reality the training data is
likely to have noise. Noise can be defined as the error from the ideal solution. For example using
the mapping x2
, our training data could in fact have the mappings 1 → 1.04 and 0.5 → 0.24. These
are not precise but in practice not every data set can be completely accurate. The noise is what
causes this inaccuracy. We want our Neural Network to find the underlying function of the training
data despite the noise as such:
Figure 3.1: An example of a curve fitting to noisy data. Image taken from a lecture by Bullinaria
in 2004 [28]
The blue curve represents the underlying function, similar to how our Example had the underlying
function x2
, and the circles represent the noisy data points forming the training set. We want a Neu-
ral Network to approximate a function, using this noisy data set, as close to the blue curve as possible.
However, two things may occur in the process:
1. Underfitting is a concept in which a Neural Network has been “lazy” and has not learned
how to fit the training data at all, let alone generalise to unseen data. This yields an all-round
poor approximator.
30

2. Overfitting is the opposite of Underfitting. In this concept a Neural Network has worked
extremely hard to learn the training data. Unfortunately, although the training data may have
been approximated perfectly, the network has poorly “filled in the gaps” and thus generalised
incompetently. This yields a poor approximator to unseen data.
To understand these concepts, consider the following figure that illustrates these cases to their
extremes:
Figure 3.2: An illustration of Underfitting (left) and Overfitting (right) of an ANN. Image taken
from a lecture by Bullinaria in 2004 [28]
The graph on the left shows extreme Underfitting of the training data and clearly the red
best fit line would be a poor approximation for almost all points on the blue curve. The
graph on the right shows extreme Overfitting in which the network has effectively generalised using
a “dot-to-dot” method and again provides a poor approximation to all points outside the training set.
Why might this occur? First let’s consider the concepts behind the error of a Neural Net-
work.
3.3.1 A Statistical View on Network Training: Bias-Variance Tradeoff
Our aim here is to identify the expected prediction error of a trained Neural Network when presented
with a previously unseen data point. Ideally we want to minimise the error between the network’s
approximated function and the underlying function of the data. Due to the addition of noise to our
training data, we may not want to truly minimise the error of our output compared to our target
which was:
Ei =
1
2
ti − a(L)
(xi)
2
(3.3.1)
If we have noisy data then the minimum of this error could cause Overfitting. We want to ensure
the ANN is able to generalise beyond the noise to the underlying function and this generalisation
will not give the minimal error on a data point by data point basis.
In 2013, Dustin Stansbury wrote an article called “Model Selection: Underfitting, Overfitting,
and the Bias-Variance Tradeoff” [29] and we shall follow his argument investigating Underfitting,
31

Overfitting and the Bias-Variance Tradeoff with adapted notation with close attention to the
section named “Expected Prediction Error and the Bias-variance Tradeoff”. Firstly let f(x) be
our underlying function we wish to accurately approximate and let F(x) be the approximating
function generated by the Neural Network. Recalling x are the training data inputs and t their
respective targets, this F(x) has been fit using our x − t pairs. Therefore we can define an expected
approximation over all data points we could present the network with from F(x) as such:
Expected approximation over all data points = E F(x) (3.3.2)
Similar to Stansbury, our overall goal is to minimise the error between previously unseen data points
and the underlying function we’ve approximated, f(x). Therefore we want to find the expected
prediction error of a new data point (x∗
, t∗
= f(x∗
) + ) where is the constant that accounts for
noise in the new data point. Thus we can naturally define our expected prediction error as:
Expected prediction error = E (F(x∗
) − t∗
)2
(3.3.3)
To achieve our overall goal, we therefore intend to minimise Equation 3.3.3 instead of minimising
Equation 3.3.1. This does not affect the Backpropagation learning algorithm imposed in Chapter
1 and theoretically these errors will be roughly the same given a successfully trained Neural Network.
Let’s investigate our expected prediction error further. First we will take the following sta-
tistical equations for granted which can also be found on Stansbury’s article [29]:
Bias of the approximation F(x) = E F(x) − f(x) (3.3.4)
Variance of the approximation F(x) = E F(x) − E F(x)
2
(3.3.5)
E X2
− E X
2
= E X − E X
2
(3.3.6)
The bias of the approximation function represents the deviation between the expected approxi-
mation over all data points (E F(x) ) with regards to our underlying function f(x). If we have a
large bias, then we can conclude that our approximation function is generally a long way from the
underlying function we are aiming for. If the bias is small then we have an accurate representation
of the underlying function in the form of our approximation function.
The variance of the approximation function is the average squared difference between an
approximation function based on a single data set (i.e. F(x)) and the expected approximation over
all data sets (i.e. E F(x) ). A large variance indicates a poor approximation to all data sets by our
single data set. A small variance indicates a good approximation to all data sets by our single data set.
Preferably we would like as small a bias and variance as possible to allow us the best approximation
function to the underlying function. Equation 3.3.6 is a well known Lemma of the properties between
the bias and variance. The statement and proof can be found posted online by Dustin Stans-
bury [30] as an extension from Stansbury’s current argument towards the Bias-Variance tradeoff [29].
Now to investigate our expected prediction error further. This argument follows Stansbury’s
argument in his section named ”Expected Prediction Error and the Bias-variance Tradeoff” [29]
32

with adapted notation:
E (F(x∗
) − t∗
)2
= E F(x∗
)2
− 2F(x∗
)t∗
+ t∗2
= E F(x∗
)2
− 2E F(x∗
)t∗
+ E t∗2
= E F(x∗
) − E F(x∗
)
2
+ E F(x∗
)
2
− 2E F(x∗
) f(x∗
) + E t∗
− f(x∗
)
2
= E F(x∗
) − E F(x∗
)
2
+ E F(x∗
) − f(x∗
)
2
+ E t∗
− f(x∗
)
2
= variance of F(x∗
) + bias of F(x∗
)
2
+ variance of the target noise (3.3.7)
As Stansbury notes, the variance of the target data noise provides us a lower bound on the expected
error prediction. Logically this makes sense. It indicates that if our data set has noise and hence
isn’t accurate, we will have some error in the prediction. Now we can see the effects of bias and
variance on our expected prediction error, which is what we want to minimise.
Similarly to Bullinaria in his 2004 lecture [28], we can now investigate our extreme examples
with regards to this Equation 3.3.7. If we pretend our network has Underfitted extremely and take
F(x) = c where c is some constant then we are going to have a huge bias. However, our variance
will be zero which overall will give a large expected prediction error. Alternatively assume our
network has Overfitted extremely and F(x) is a very complicated function of large order such that
it fits our training data perfectly. Then our bias is zero but our variance on the data is equal to the
variance on the target noise. This variance could be huge in practice depending on the data set you
are presenting your trained Neural Network with. This defines the Bias-Variance Tradeoff.
Preferentially we wanted to minimise the bias and the variance. However this explanation
shows that as one increases, the other decreases and vice versa. This means we have a point at
which the bias and variance provide the smallest expected prediction error and completes our aim.
If we favour bias or variance too much then we risk running into the problems with Underfitting
and Overfitting as described above.
3.3.2 Applying Overfitting and Underfitting to our Example
Having now researched the concepts of Overfitting and Underfitting and the reasons for them occur-
ring we can illustrate them using our Example in Chapter 2. To put this into practice a couple of
alterations had to be made to the code used for Chapter 2 in Appendix A:
1. Firstly to ensure clarity in the Overfitting and Underfitting, noise was added to our data points
in our training set. I also added in more data points to our training set to finally give us the
following set of training data:
(0, -0.05),
(0.05, -0.0026),
(0.1, 0.011),
(0.15, 0.022),
(0.2, 0.041),
(0.25, 0.0613),
(0.3, 0.093),
(0.35, 0.12),
(0.4, 0.165),
(0.45, 0.2125),
(0.5, 0.26),
(0.55, 0.295),
(0.6, 0.364),
(0.65, 0.4225),
33

(0.7, 0.49),
(0.75, 0.553),
(0.8, 0.651),
(0.85, 0.72),
(0.9, 0.805),
(0.95, 0.9125),
(1, 1.04)
Recalling our original data set consisted of x ∈ {0, 0.1, 0.2, ..., 1}.
2. The number of epochs was increased from 10000 to 40000.
3. The Neural Network was originally defined to produce a network with 2 input nodes, 3 hidden
nodes and 1 output node which includes the bias nodes on the input and hidden layer. For
the following examples the number of hidden nodes will be changed appropriately to yield the
desired results. This number will be specified for each example and will include the bias node.
Everything else remains the same, including the learning rate and the number of test data points
to be plotted on the graphs.
Let’s begin by considering a control graph:
Figure 3.3: A graph to show convergence can occur with the new code setup. This network used
4 hidden nodes.
Figure 3.3 shows a graph generated by an MLP which used 4 hidden nodes and proves that the new
setup for our Example Network can produce results extremely similar to the “Great Result” from
Chapter 2.
Now to demonstrate Overfitting and Underfitting of the network. Overfitting can occur if
we allow the bias to be low and we can generate this problem by building the network to be
too complex. Therefore the network was built with an excessive number of nodes, 11 in total.
Similarly, we can illustrate Underfitting by allowing the network to be too simple and thus unable
to appropriately map each training data point to its target:
34

(a) A network whose training has caused Overfitting.
A total of 11 hidden nodes were used.
(b) A network whose training has caused Underfit-
ting. A total of 2 hidden nodes were used.
Figure 3.4: Two graphs representing our Example Network’s ability to Overfit (left) or Underfit
(right) under a poorly chosen size of hidden layer.
Figure 3.4a under inspection with the training set demonstrates Overfitting, although not to the
extreme as was discussed earlier. The excessive number of hidden nodes allows the network to form a
much more complex approximation function than a quadratic and therefore it attempts to hit all the
noisy training data points as best as possible. This could be exaggerated further by increasing the
number of epochs, allowing the network to train longer and become even more suited to the train-
ing data. However, for fairness and continuity’s sake I did not change this for the Overfitting example.
Figure 3.4b shows a clear Underfitting. Having just a total of 2 hidden nodes of which only
the sigmoid node has the ability to produce a function that isn’t linear, is nowhere enough to map
our 21 data points to any sort of accuracy. It therefore compromises by effectively approximating a
step function.
This clearly indicates the necessity for an appropriately sized hidden layer in a single hidden
layer MLP or in fact any Neural Network which can exhibit Underfitting and Overfitting. Too
many hidden nodes allows the network to Overfit and too few gives the network no ability to map
all data points accurately and hence Underfits. So in general, how can we avoid these problems?
3.3.3 How to Avoid Overfitting and Underfitting
Bullinaria highlights some important analysis in methods of preventing Overfitting and Underfitting
[28]. To help prevent the possibilities of Overfitting we can consider the following when building our
network:
• Do not build the network with too many hidden nodes. If Overfitting is clearly occurring
during run throughs, reducing the number of hidden nodes should help.
• The network can be instructed to cease training under evidence of Overfitting beginning to
occur. If the error on a test set of data after each epoch of training increases for a pre-defined
threshold of consecutive measurements, say 10 epochs, then the network can be commanded
35

to stop training and use the solution prior to the increase of the errors. It is important not to
use the training data for this because Overfitting can occur without noticing.
• Adding noise to the training data is in fact recommended by Bullinaria [28] because it allows
the network the chance to find a smoothed out approximation whereas if one data point was
an anomaly in the data set (and could be described as the only point with noise) then this
could dramatically effect the networks ability to predict unseen data in a neighbourhood of
this anomaly.
To assist in Underfitting avoidance, being mindful of the following helps:
• If Underfitting is occurring, increasing the number of parameters in the system is necessary.
This can be done by increasing the size of the hidden layer or even adding in other layers. If the
system has too few parameters to represent all the mappings, it will be unable to accurately
approximate each mapping.
• The length of training must be long enough to allow for suitable convergence to the global
minimum of the Error function. If you only train your network for 1 epoch, chances are your
approximation function will be incredibly inaccurate. On the other hand, you don’t want to
train too long and risk Overfitting.
3.4 Conclusions
In this Chapter we learned about the Universal Approximation Theorem which informally states
that we can approximate any continuous function to arbitrary accuracy using a single hidden
layer MLP with an arbitrary sigmoidal function. We then investigated Cybenko’s 1989 paper [19]
detailing his proof on the matter and became aware of the fact it was the architecture of the MLP
that allowed this rather than the choice of activation function as proved by Kurt Hornik in 1991
[27].
Furthermore, the Universal Approximation Theorem does not provide us with any sort of
bound or indication on the number of hidden nodes necessary to do this, it just provides the
knowledge that we can approximate any continuous function. This brought into question the
considerations necessary for appropriately choosing the size of our hidden layer.
Moreover, we investigated the problems associated with this choice, most notably Overfitting
and Underfitting. We discovered that the Bias and Variance of our data set and ultimate approx-
imation function were closely linked and we had to find a favourable Tradeoff in which the sum
of the Bias squared with the Variance was minimised to minimise the expected prediction error
of a new data point when presented to the trained network. Tending towards a low Bias causes
Underfitting but tending towards a low Variance causes Overfitting.
Finally we illustrated, using an adapted version of our Example Network from Chapter 2,
that under these conditions we can show evidence of Underfitting and Overfitting occurring. We
were able to show Underfitting by minimising the number of nodes in the hidden layer, preventing
the network from having the required parameters to map all of our data accurately. Similarly,
excessively increasing the size of the hidden layer allowed too great a number of mappings which
resulted in Overfitting.
36

It is fair to say that the choice of the number of hidden nodes is quite delicate. There are
no concrete theorems to suggest how many hidden nodes to choose for an MLP with regards to
input and output layer sizes and training data set sizes. Therefore we can conclude that slowly
guessing and increasing the number of nodes in a single hidden layer is practically useless when faced
with a complex problem, for example image recognition. Given a 28x28 pixel picture, we already
need 784 input nodes to begin with. Combining this with the length of time it can take to train a
network of such huge size calls into question the eﬃciency of a single hidden layer MLP. Additionally,
considering preventing Underﬁtting can essentially come down to a sensible training time and a
suitable number of parameters to encompass all the mappings necessary, perhaps it is useful to
contemplate a greater number of hidden layers, especially after appreciating the improvement the
single hidden layer MLP had on the perceptron. This shall be our next destination.
37

Chapter 4
Multiple Hidden Layer MLPs
4.1 Motivation for Multiple Hidden Layers
In Chapter 3 we ascertained that to prevent Underfitting we essentially just need to ensure there
are enough parameters in the MLP to allow for all mappings of our data (and train the network
for suitably long). We also theorised that one layer may cause a problem in approximation, due
to the unknown number of hidden nodes required for an approximation to be accurate to our
underlying continuous function, as in the Universal Approximation Theorem. As well as the problem
of approximation, a large number of hidden nodes would require a large training time as each
connection requires a calculation by the Backpropagation algorithm to update the associated weight.
We shall now consider the impact of adding one extra hidden node to an MLP and then
comparing this with other MLPs with the same total number of hidden nodes which have a slightly
different architecture (i.e. more hidden layers). We will then compare the number of calculations
each MLP would have to undergo in training via the number of weighted connections in each
MLP and simultaneously compare the complexity to which the MLPs are able to model. Overall,
we aim to minimise the number of calculations undertaken by the Neural Network and maximise
its flexibility in the number of mappings it can consider. In theory, the smaller the number
of calculations the Neural Network has to make, the shorter training time will be. Similarly,
maximising the flexibility of the network by finding the simplest architecture for a certain flexibility
requirement should also decrease training time. This is because the simplest architecture really
means the fewest total nodes in the network and therefore fewer calculations between inputting x
and receiving an output F(x).
First let’s define our base MLP, the one which we will be adding a hidden node to before
restructuring the hidden nodes for comparison. Let the number of connections := #connections.
Similarly let the number of routes through the network (i.e. any path from any input node to any
output node) := #routes. Then our base MLP will have the structure of 2 input nodes, 4 hidden
nodes and 2 output nodes (i.e. structure = 2-4-2):
38

l = 0 l = 1 l = L = 2 Structure: 2-4-2
#connections = 16
#routes = 16
Figure 4.1: Our base MLP
We can simply check the number of connections and routes by hand to find the numbers are correct.
Recalling our input layer is l = 0 and our output layer is l = L allows us to define the following
equations for calculating #connections and #routes:
|l| := the number of nodes in layer l
#connections :=
L−1
l=0
|l| |l + 1| (4.1.1)
#routes :=
L
l=0
|l| (4.1.2)
If we check these equations with Figure 4.1 we find the numbers match up. We can understand
where these equations come from by a simple logical argument. For the number of connections, we
know an MLP must be fully connected as this is one of its defining features. Therefore every node
from one layer will connect to every node on the following layer and this provides Equation 4.1.1.
For the number of routes, we can say we have 2 choices for the first node on our path in Figure
4.1. We then have 4 choices for the second node on our path and 2 choices for our output node to
complete our path. As an MLP is fully connected, this generates Equation 4.1.2.
Next we intend to add one more node to our hidden layer of our base MLP and then go
about restructuring our hidden nodes. This will give us a total of 5 hidden nodes to work with and
we can see the increase in #connections and #routes:
39

Structure: 2-5-2
#connections = 20
#routes = 20
Figure 4.2: An MLP which has one additional hidden node compared to our base MLP
#connections has increased by 4 and similarly #routes has increased by 4. The addition of this
node increases both the number of mappings available and the number of calculations for training
as expected. Of course this is not what we are aiming for and it’s helpful to notice that all we’ve
done is increase our risk of Overfitting.
Now let’s investigate what happens if we change the structure of our hidden nodes by putting our
additional hidden node into its own second hidden layer as such:
Structure: 2-4-1-2
#connections = 14
#routes = 16
Figure 4.3: A restructured MLP of Figure 4.2
We find the number of routes this network provides is the same number of routes our base MLP
provides. However, in our new MLP with 2 hidden layers here, we only have 14 connections in com-
parison to the original 16. This is good news because we achieve the same flexibility for mappings
but there are fewer calculations required in the Backpropagation algorithm and hence faster training.
Further, let’s consider a few more different architectures of the MLP with a total of 5 hid-
den nodes:
Structure: 2-3-2-2
#connections = 16
#routes = 24
Figure 4.4: A second restructure of the MLP in Figure 4.2
40

Structure: 2-2-2-1-2
#connections = 12
#routes = 16
Figure 4.5: A third restructure of the MLP in Figure 4.2
Structure: 2-2-1-1-1-2
#connections = 10
#routes = 8
Figure 4.6: A fourth restructure of the MLP in Figure 4.2
Figure 4.4 is arguably the most favourable choice for the structure of 5 hidden nodes given the size
of the input and output layers we have. This is because it gives us the greatest ratio of routes to
connections at 3 : 2. This maximises the complexity of the network with minimal calculations and
hence training time as well as simplicity of the network.
Figure 4.5 improves on Figure 4.3 in which it gives you the same number of routes but even
fewer connections. This appears as if it could also be an optimal choice of network with 2 hidden
layers for our base MLP.
As one would expect, stretching the number of hidden nodes to more and more layers re-
sults in poorer networks for our aims. Naturally when building a network, one would not believe a
series of one node layers would have any benefit to us. This would just cause serious Underfitting
and hence Figure 4.6 is here to demonstrate how increasing a network by more and more layers
isn’t beneficial.
We can conclude that the architecture of the network has interesting benefits with the inten-
tion to minimise training time by minimising connections and hence calculations without losing the
approximation capabilities our Neural Network is able to produce. One may think applying our
logic to our base MLP could show similar benefits so let’s check:
Structure: 2-2-2-2
#connections = 12
#routes = 16
Figure 4.7: An advantageous restructure of our base MLP in Figure 4.1
As one may have surmised, we can theoretically decrease training time without loss of generality
to the number of mappings available. This form of our base MLP in Figure 4.1 gives us the same
number of routes but with 4 fewer connections. Figure 4.7 actually yields identical results to Figure
4.5 except Figure 4.7 uses fewer nodes and hence should be even faster in training because there is
41

one fewer node to undertake calculations.
In conclusion, the advantages to adding more hidden layers include:
• A simplified architecture without loss of mapping ability
• Shorter training times due to a decrease in calculations within the learning algorithm
These advantages are significantly amplified if we need millions of hidden nodes to satisfy an ap-
proximation function ε-close to an underlying function as in the Universal Approximation Theorem.
Breaking a large single hidden layer into multiple layers will leave benefits. For an example, assume
we have n input nodes, n output nodes and 106
hidden nodes. Then for a single layer MLP we would
have:
#connections = 2n · 106
#routes = n2
· 106
If we wanted to find an MLP with 2 hidden layers with the same mapping ability, then we would
need the size of our two hidden layers, l = 1 and l = 2, such that |l = 1| · |l = 2| = 106
. Assuming
they are the same size for simplicity, we can let each hidden layer contain just 1000 nodes (because
#routes = n · 103
· 103
· n = n2
· 106
as before). We have already decreased the number of nodes in
the network by 106
− 2000 = 998, 000. This is a significant decrease in the number of calculations
required before an output of the network is given.
This implies that #connections = n · 103
+ 103
· 103
+ 103
· n = 2000n + 106
< 2n · 106
⇔ 0.˙50˙0 < n
and hence there are fewer connections, as well as fewer nodes in the network for any chosen n. The
greater n gets, the greater the decrease in connections. This describes a network whose training
time would significantly decrease if a second hidden layer were to be implemented instead of using
just a single layer MLP.
One may ask, “Why doesn’t everyone just employ large multiple hidden layer MLPs?”. As
one may have also suspected, training an MLP with a large number of hidden layers also has its
problems. We will investigate these problems after first considering an example problem in which
2+ hidden layers truly are useful.
4.2 Example Problems for Multiple Hidden Layer MLPs
Generally we can rely on 2 hidden layer MLPs to solve most problems we can expect to come across
in which Neural Networks would be an ideal instrument for a solution. For example, finding obscure
patterns in huge data sets such as a pattern between how likely someone is to get a job compared
to the number of cups of coffee they have a week. However, it sometimes pays off to include a larger
number of layers.
4.2.1 Image Recognition
Let’s consider simple image recognition. If we consider a U.K. passport photo, its size is 45mm by
35mm [31] which equals 1575mm2
. The U.K. demands professional printing which could be variable
42

in pixels per millimetre ratios so basing this on the U.S.A requirements of picture quality, they
demand a minimum of 12 pixels per millimetre [32]. This gives 144 pixels per mm2
and therefore
a standard UK passport photo contains 226, 800 pixels at least. If the image recognition system
was using the passport photo database to recognise a headshot of a criminal they’re searching for
then the neural network would need 226, 800 input nodes, one for each pixel, to teach this system
how to do so. It would then require enough hidden nodes and layers to recognise the colours and
facial features etc. to accurately identify the person. With this construct the advantage to applying
multiple layers to allow this feature detection is clear to see. Reducing the number of connections
would be extremely useful without losing mapping complexity to allow quicker training and running
times.
4.2.2 Facial Recognition
Figure 4.8: A picture demonstrating the use of mul-
tiple hidden layers in a feedforward Neural Network
Facial recognition is a further step up from
image recognition because it jumps from 2
dimensions to 3 dimensions. Facial recogni-
tion requires a Neural Network to learn ex-
tremely complex shapes such as the contours
and structure of your face as well as eye, skin
and lip colour. To teach a Neural Network
to recognise such complex shapes it must
be taught to recognise simple features like
colour and lines before moving onto shapes
and contours which can be done layer by
layer. Figure 4.8 is an image constructed
by Nicola Jones in an article called “Com-
puter science: The learning machines” from
2014 [33]. The images within are courtesy
of an article by Honglak Lee et al. [34].
The image from Jones briefly details how
each layer of this Neural Network could be
seen as a feature detector, and the features
to be detected get more complex the fur-
ther into the network we travel until even-
tually the Neural Network is able to con-
struct faces. As can be seen from the final
image, the constructed faces are still lack-
ing a significant amount of detail. This is
mainly due to the lack of understanding to-
wards teaching a network such a task, but
more layers could again be included to allow
even greater refinement.
43

4.3 Vanishing and Exploding Gradient Problem
In 1963, Arthur Earl Bryson et al. published an article regarding “Optimal Programming Problems
with Inequality Constraints” in which the first arguable claim to the invention of the Backpropaga-
tion algorithm can be assigned [35]. The theories made in this paper were being solely applied to
programming and not until 1974 did Paul J. Werbos apply it to Neural Networks in his PhD Thesis
named “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences” [10].
Unfortunately, the lack of computer power to make the application efficient enough for use meant
the Backpropagation algorithm went quiet until 1986. 1986 was the year of the publication by
Rumelhart, Hinton and Williams in which a breakthrough was made in the efficiency of the algorithm
and the upturn in computing power available to run said algorithm [11]. Then, an incredible 28
years on from the Bryson article, Sepp Hochreiter finished his Diploma thesis on the “Fundamental
Problem in Deep Learning” [36]. This thesis described the underlying issues with training large
deep Neural Networks. These underlying issues defined significant research routes through the ’90s
and ’00s and although usually applied to what are called recurrent Neural Networks (which can
be described as feedforward Neural Networks that also include the concept of time and therefore
allow loops within layers) we can apply this to our MLPs as they are a form of deep Neural Networks.
The underlying issues are now famously known as the Vanishing and Exploding Gradient
Problems. The Backpropagation algorithm utilises gradient descent as the crux of operation. As the
number of hidden layers increases, the number of gradients the learning algorithm has to calculate
through to update weights early in the network increases exponentially. This either leads to a
vanishing gradient in which the weight updates become negligible towards the start of the network,
or an exploding gradient in which the weight updates become exponentially bigger as we reach the
start of the network. The former causes issues with extracting critical features from the input data
which leads to Underfitting of the training data and the latter leaves the learning algorithm entirely
unstable and reduces the chances of finding a suitable set of weights.
Let’s investigate why these two problems occur within the Backpropagation algorithm and
understand that it is a consequence of using this learning algorithm. The backbone of this algorithm
boils down to iteratively making small adjustments to our weights in an attempt to find a minimum.
Ideally we would like to converge on the global minimum, but there is always the chance we will
converge on a local minimum. Considering the complexity of multiple hidden layer MLPs with
hundreds of nodes, chances are a local minimum will be found. First of all, let’s remind ourselves of
the weight update equations from Chapter 1:
∆w
(l)
ij = −η
∂Ei
∂w
(l)
ij
(4.3.1)
represents our actual weight update equation. This was then determined by its Error function
derivative term which is:
∂E
∂w
(l)
ij
=
∂E
∂a
(l)
j
∂a
(l)
j
∂in
(l)
j
a
(l−1)
i =
∂E
∂a
(l)
j
aj (in
(l)
j )a
(l−1)
i
with:
∂E
∂a
(l)
j
=



(a
(L)
j − tj) if j is a node in the output layer
k∈K
∂E
∂a
(l+1)
k
ak (in
(l+1)
k )w
(l+1)
jk if j is a node in any other layer
(4.3.2)
44

recalling that a
(l)
j is the activation function of the j-th node in layer l and aj (in
(l)
j ) is the derivative
of the activation function of the j-th node on layer l with respect to the input to that node. For a
node not in the output layer, we can simplify this equation by defining the error of the j-th node in
layer l to be δ
(l)
j :
δ
(l)
j =
∂E
∂a
(l)
j
aj (in
(l)
j )
= aj (in
(l)
j ) ·
k∈K
∂E
∂a
(l+1)
k
ak (in
(l+1)
k )w
(l+1)
jk
= aj (in
(l)
j ) ·
k
δ
(l+1)
k w
(l+1)
jk (4.3.3)
This is an especially important method of phrasing the weight adjustments with respect to these
gradients δ
(l)
j . We notice that the gradients of layer l depend on all gradients of the layers ahead of
it (i.e. layers l = l+1, l+2, ..., L). This is the cause of the Vanishing/Exploding Gradient Problem.
In his soon to be published book called “Neural Networks and Deep Learning” [37], Matthew
Nielsen offers a simple example to demonstrate these problems. We will now adapt his example
here by considering a 3 hidden layer MLP:
x σ σ σ
w1 w2 w3 w4
f(x)
Let’s calculate the change in error with respect to the first weight using our equations from above:
∂E
∂w1
= a (in
(1)
1 ) · w1 · a (in
(2)
2 ) · w2 · a (in
(3)
3 ) · w3 · a (in
(4)
4 ) · w4 · E · x (4.3.4)
where E is the error from the output node and for this case we have a(0)
= x representing the
output from the input layer. From this we can see that there are two very distinctive dependencies
here on the changes to w1; the weights which will have been randomly initialised and the choice
of our activation function. The weights can ultimately decide which problem we have; Vanishing
or Exploding. If our weights are small then the Vanishing Gradient Problem may occur, but if
we choose large weights then we can result in an Exploding Gradient Problem. Generally, the
Vanishing Gradient is more common due to our desire to minimise our parameters such as smaller
weights, normalised input data and as small an output error as possible.
However, let’s also consider the impact of the choice of activation function. In Chapter 1 we
looked at two in particular, the logistic function as used in Chapter 2 for our Example, and the
hyperbolic tangent which we have pretty much ignored until now. The impact of the activation
function’s derivative to the change of w1 could be significant so let’s consider the following figure:
45

(a) A graph representing the derivative of the logistic
function
(b) A graph representing the derivative of tanh(x)
Figure 4.9: Graphs representing the derivatives of our two previously seen activation functions
Figure 4.9 shows the form of the derivatives of the logistic function and tanh(x). The logistic
function’s derivative is such that σ(x) ≤ 0.25 with its maximum at x = 0. In comparison, the
hyperbolic tangent’s derivative has a maximum value of 1 also at x = 0. If we consider weights that
are all deﬁned such that wi ≤ 1 for i ∈ {1, 2, 3, 4} then for the activation function chosen as the
logistic function:
∂E
∂w1
≤
1
44
· E · x (4.3.5)
which, considering we normalise our input data x ∈ [0, 1], is going to be a negligible change to the
weight unless our learning rate is huge. Recalling we don’t want to choose a high learning rate
because this causes serious problems with convergence on a minimum, it is simple to realise that
this problem is very much unavoidable. The maximum value of the activation function’s derivative
is one of the reasons supporting the use of the hyperbolic tangent function as well as it having the
qualifying conditions for the Universal Approximation Theorem.
Although this was a very simple example to demonstrate the exponential problem to the
Backpropagation algorithm, it holds for much more complex networks in which the number of
hidden nodes and layers are increased. The greater the number of layers, the greater the impact of
the exponential weight change from the repeating pattern of the gradients.
Adding in more nodes and layers increases the complexity and this can actually exacerbate
the problem. With an increased complexity, there is a chance of increasing the frequency and
severity of local minima which is noted by James Martens in a presentation of his from 2010 [38]. If
we have a greater number of local minima, then we have a greater chance of slipping into one. The
error may be large at this minima, but the gradient of the minima will cause only a small change to
weights nearer the end of the network, which means even less of a change to weights in early layers
of the network.
Therefore, as stipulated by Hochreiter originally in his thesis [36], the fundamental problem
with gradient descent based learning algorithms like the Backpropagation algorithm is its charac-
46

Neural Networks on Steroids

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (16)

Similar to Neural Networks on Steroids

Similar to Neural Networks on Steroids (20)

Recently uploaded

Recently uploaded (20)

Neural Networks on Steroids