Although conventional computers have been shown to be e ective at a lot of de-
manding tasks, they still seem unable to perform certain tasks that our brains do
so easily. These are tasks like for instance pattern recognition and various kinds
of forecasting. That we do these tasks so easily has a lot to do with our learning
capabilities. Conventional computers do not seem to learn very well.
In January 1997, the NRC Handelsblad, in its weekly science subsection, published
a series of four columns on neural networks, a technique that overcomes some of the
above-mentioned problems. These columns aroused my interest in neural networks,
of which I knew practically nothing at the time. As I was just looking for a subject
for a paper, I decided to nd out more about neural networks.
In this paper, I will start with giving a brief introduction to the theory of neural
networks. Section 2 discusses associative memory, which is a simple application of
neural networks. It is a exible way of information storage, allowing retrieval in an
In sections 3 to 5, general neural networks are discussed. Section 3 shows the be-
haviour of elementary nets and in section 4 and 5 this theory is extended to larger
nets. The back propagation rule is introduced and a general training algorithm is
derived from this rule.
Sections 6 to 9 deal with three applications of the back propagation network. Using
this type of net, we solve the XOR-problem and we use this technique for curve
tting. Time series forecasting also deals with predicting function values, but is
shown to be a more general technique than the introduced technique of curve tting.
Using these applications, I demonstrate several interesting phenomena and criteria
concerning implementing and training networks, such as stopping criteria, over-
training and forgetting.
Finally, I'd like to thank Rob Bisseling for his supervision during the process and
Els Vermij for her numerous suggestions for improving this text.
Jeroen van Grondelle
Utrecht, July 1997
1 An introduction to neural networks
In this section a brief introduction is o ered to the theory of neural networks. This
theory is based on the actual physiology of the human brain and shows a great
resemblance to the way our brains work.
The building blocks of neural networks are neurons . These neurons are nodes in
the network and they have a state that acts as output to other neurons. This state
depends on the input the neuron is given by other neurons.
Figure 1: A neuron
A neural network is a set of connected neurons. The connections are called synapses .
If two neurons are connected, one neuron takes the output of the other neuron as
input, according to the direction of the connection.
Neurons are grouped in layers . Neurons in one layer only take input from the pre-
vious layer and give output to the next layer 1 .
Every synapse is associated with a weight. This weight indicates the impact of the
output on the receiving neuron. The state of neuron i is de ned as:
si = f wik rk ; (1)
where rk are the states of the neurons that give input to neuron i and wi k represents
the weight associated with the connection. f (x) is the activation function. This
function is often linear or a sign-function whwn we require binary output. The sign
function is generally replaced by a continuous representation of this function. The
value is called the threshold.
input hidden layer
Figure 2: A single and multi-layered network
1 Networks with connections skipping layers are possible, but we will not discuss them in this
A neural net is based on layers of neurons. Because the number of neurons is nite,
there is always an input layer and an output layer , which only give output or take
input respectively. All other layers are called hidden layers . A two-layered net is
called a simple perceptron and the other nets multi-layered perceptrons. Examples
are given in gure 2
2 Associative memory
2.1 What is associative memory?
In general, a memory is a system that both stores information and allows us to
recall this information. In computers, a memory will usually look like an array. An
array consists of pairs (i ), where is the information we are storing, and i is the
index assigned to it by the memory on storage. We can recall the information by
giving the index as input to the memory:
Figure 3: Memory recall
This is not a very exible technique: we have to know exactly the right index to
recall the stored information.
Associative memory works much more like our mind does. If we are for instance
looking for someone's name, it will help to know where we met this person or what
he looks like. With this information as input, our memory will usually come up
with the right name. A memory is called an associative memory if it permits the
recall of information based on partial knowledge of its contents.
2.2 Implementing Associative Memory using Neural Net-
Neural networks are very well suited to create an associative memory. Say we wish
to store p bitwords2 of length N . We want to recall in an associative way, so we
want to give as input a bitword and want as output the stored bitword that most
resembles the input.
So it seems the obvious thing to do is to take an N-neuron layer as both input and
output layer and nd a set of weights so that the system behaves like a memory for
the bitwords 1 : : : p :
Input 1 2 3 4 N
Output 1 2 3 N
Figure 4: An associative memory con guration
If now a pattern s is given as input to the system, we want to be the output, so
2 For later convenience, we will work with binary numbers that consist of 1's and ;1's, where
;1 replaces the usual zero.
that s and di er in as few places as possible. So we want the error Hj
Hj = (si ; ij )2 (2)
to be minimal if j = . This Hj is called the Hamming distance3 .
We will have a look at a simple case rst. Say we want to store one pattern . We
will give an expression for w and check that it suits our purposes:
wij = N i j (3)
If we give an arbitrary pattern s as input, where s di ers in n places from the stored
pattern , we get:
0N 1 0 1
X A 1X sA
Si = sign @ wij sj = sign @ i N j j (4)
j =1 j =1
Now examine N=1 j sj . If sj = j , then j sj = 1, otherwise it is ;1. Therefore,
the sum equals (N ; n) + ;n, and:
1 X A
sign @ i N j sj = sign
N i (N ; 2n) = sign 1 ; N i (5)
There are two important features to check. First, we can see that if we choose
s = , the output will be . This is obvious, because and di er in 0 places.
We call this stability of the stored pattern. Secondly, we want to check that if we
give an input reasonably close to , we get as output. Obviously, if n < N , the
output will equal . Then 1 ; 2n does not a ect the sign of i. This is called
convergence to a stored pattern.
We now want to store all the words 1 : : : p . And again we will give an expression
and prove that it serves our purpose. De ne
wij = N1X p
The method will be roughly the same. We will not give a full proof here. This would
be too complex and is not of great importance to our argument. What is important
is that we are proving stability of stored patterns and the convergence to a stored
pattern of other input patterns. We did this for the case of one stored pattern. The
method for multiple stored patterns is similar. Only, proving the error terms to be
small enough will take some advanced statistics. Therefore, we will prove up to the
error terms here and then quote Muller]:
Because the problem is becoming a little more complex now, we will discuss the
activation value for an arbitrary output neuron i, usually referred to as hi . First
we will look at the output when a stored pattern (say ) is given as input:
0 N 1
1 X X X
hi = wij j = N i j j = N @ i j j + i j j A (7)
j =1 =1 j =1 j =1 6= j =1
3 Actually, this is the Hamming distance when bits are represented by 0's and 1's. The square
then acts as absolute-value operator. So we should scale results by a constant factor 25 to obtain
the Hamming distance.
The rst part of the last expression is equal to i due to similar arguments as in
the previous one-pattern case. The second expression is dealt with using laws of
statistics, see Muller].
Now we give the system an input s where n neurons start out in the wrong state.
Then generalizing (7) similar to (5) gives:
hi = 1 ; 2n 1X X
N i +N i j sj (8)
6= j =1
The rst term is equal to that of the single-pattern storage case. And the second is
again proven small by Muller]. Moreover, it is proven that
hi = 1 ; 2n +O p;1 (9)
N i N
So if p << N the system will still function as a memory for the p patterns. In
Muller], it is proven that as long as p < :14N the system will function well.
2.3 Matlab -functions implementing associative memory
In Appendix A.1 two Matlab functions are given for both storing and recalling
information in an associative memory as described above. Here we will make some
short remarks on how this is done.
2.3.1 Storing information
The assostore-function works as follows:
The function gets a binary matrix S as input, where the rows of S are the patterns
to store. After determining its size, the program lls a matrix w with zeros. The
values of S are transformed from (0,1) to (-1,1) notation. Now all values of w are
computed using (6). This formula is implemented using the inner product of two
columns in S . The division by N is delayed until the end of the routine.
2.3.2 Recalling information
assorecall.m is also a straightforward implementation of the procedure described
above. After transforming from (0,1) to (-1,1) notation, s is computed as w times
the transposed input. The sign of this s is transformed back to (0,1) notation.
3 The perceptron model
3.1 Simple perceptrons
In the previous section, we have been looking at two-layered networks, which are
also known as simple perceptrons. We did not really go into the details. An expres-
sion for w was given and we simply checked that it worked for us. In this section
we will look closer at what these simple perceptrons really do.
Let us look at a 2-neuron input, 1-neuron output simple perceptron, as shown in
Figure 5: A (2,1) simple perceptron
This net has only two synapses, with weights w1 and w2 , and we assume S1 has
threshold . We allow inputs from the reals and take as activation function the
sign-function. Then the output is given by:
S1 = sign(w1s1 + w2 s2 ; ) (10)
There is also another way of looking at S1 . The inner product of w and s actually
de nes the direction of a line in the input space. determines the location of this
line and taking the sign over this expression determines whether the input is on one
side of the line or at the other side. This can be seen more easily if we rewrite (10)
S1 = ;1 if w1s1 + w2s2 >
if w1s1 + w2s2 < (11)
So (2,1) simple perceptrons just divide the input space in two and return 1 at one
half and -1 at the other. We visualize this in gure 6:
Figure 6: A simple perceptron dividing the input space
We can of course generalize this to (n,1) simple perceptrons, in which case the
perceptron de nes a (n-1)-dimensional hyperplane in the n-dimensional input space.
The hyperplane view of simple perceptrons also allows looking at not too complex
multi-layered nets.As we saw before, every neuron in the rst hidden layer is an
indicator of a hyperplane. But the next hidden layer again consists of indicators of
hyperplanes, de ned this time on the output of the rst hidden layer. Multi-layered
nets soon become far too complex to study in such a concrete way. In the literature
we see that multi-layered nets are often regarded as black boxes. You know what
goes in, you train until the output is right and you do not bother about the exact
actions inside the box. But for relatively small nets, it can be very interesting to
study the exact mechanism, as it can show whether or not a net is able to do the
required job. This is exactly what we will do in the next subsection.
3.2 The XOR-problem
As we have seen, simple perceptrons are quite easy to understand and their be-
haviour is very well modelled. We can visualize their input-output relation through
the hyperplane method.
But simple perceptrons are very limited in the sort of problems they can solve.
If we look for instance at logical operators, we can instantly see one of its limits.
Although a simple perceptron is able to adopt the input-output relation of both
the OR and AND operator, it is unable to do the same for the Exclusive-Or gate, the
s1 s2 S
-1 -1 -1
1 -1 1
-1 1 1
1 1 -1
Table 1: The truth table of the XOR-function
We examine rst the AND-implementation on a simple perceptron. The input-output
relation would be:
Figure 7: Input-output relation for the AND-gate
Here the input is on the axes, and a black dot means output 1 and a white dot
means output ;1. As we have seen in section 3.1, a simple perceptron will de ne
a hyperplane, returning 1 at one side and -1 at the other. In gure 8, we choose
a hyperplane for both the AND and the OR-gate input space. We immediately see
why a simple perceptron will never simulate an XOR-gate, as this would take two
hyperplanes, which a simple perceptron can not de ne.
AND OR XOR
-1 1 -1 1 -1 1
-1 -1 -1
Figure 8: A hyperplane choice for all three gates
It is now almost trivial to nd the simple perceptron solution to the rst two gates.
Obviously, (w1 w2) = (1 1) de nes the direction of the chosen line. It follows that
for the AND-gate = 1 works well. In the same way we compute values for the
OR-gate: w1 = 1 w2 = 1 and = ;1.
When neural nets were only just invented and these obvious limits were discovered,
most scientists regarded neural nets as a dead end. If problems this simple could
not be solved, neural nets were never going to be very useful. The answer to these
limits were multi-layered nets.
3.3 Solving the XOR-problem using multi-layered perceptrons
Allthough the XOR-problem can not be solved by simple perceptrons, it is easy to
show that it can be solved by a (2,2,1) perceptron. We could prove this by giving
a set of suitable synapses and prove its functioning. We could also go deeper into
the hyperplane method. Instead of these options, we will use some logical rules and
express the XOR operator in terms of OR and AND operators, which we have seen we
can handle. It can be easily shown that:
(s1 XOR s2 ) , (s1 ^ :s2 ) _ (:s1 ^ s2 ) (12)
We have neural net implementations of the OR and AND operator. Because we are
using 1 and -1 as logical values, :s1 is equal to ;s1 . This makes it easy to put s1
and :s2 in a neural AND-gate. We will just negate the synapse that leads from :s2 to
S1 and use s2 as input instead of :s2 . This suggests the following (2,2,1)-solution:
The input layer is used as usual and feeds the hidden layer, consisting of hs1 and
hs2 . These function as AND-gates as indicated in (12). S , the only element in the
output layer, implements the OR-symbol in (12).
By writing down the truth table for the system, it can easily be shown that the
given net is correct.
1 -1 1
θ =1 θ =1
Figure 9: A (2,2,1) solution of the XOR-gate
4 Multi-layered networks
In the previous section, we studied a very speci c case of multi-layered networks. We
could determine its synaptic strengths because it was a combination of several simple
perceptrons, which we studied quite thoroughly before, and because we could reduce
the original problem to several subproblems that we already solved using neural nets.
In the preface, several tasks were mentioned such as character recognition, time
series forecasting, etc. These are all very demanding tasks, which need considerably
larger nets. These tasks are also problems we do not understand so well. So we are
not able to de ne subproblems, which we could solve rst. The strong feature of
neural nets that we are going to use here is that, by training, the net will learn the
input-output relation we are looking for. We are not concerned with the individual
function of neurons in this section we will consider the net as the earlier mentioned
Let us discuss a concrete example here. A widely used application of neural nets is
that of character recognition. The input of our black box could then be for instance
a 8 8 matrix of ones and zeros, representing a scan of a character. The output
could consist of 26 neurons, representing the 26 characters of the alphabet.
Since we do not have a concrete solution in for instance hyperplane or logical terms
to implement in a net, we choose more or less at random a net con guration and
synaptic strengths for this net. Not all net con gurations are able to learn all prob-
lems (we have seen a very obvious example of that before) but there are guidelines
and rough estimations on how large a net has to be. We will not go into that right
Given our net, every scan given as input will result in an output. It is not very likely
that this net will do what we want from the start, since we initiated it randomly. It
all depends on nding the right values for the synaptic strenghts. We need to train
the net. We give it an input and compare the output with the result we wanted
to get. And then we will adjust the synaptic strenghts. This is done by learning
algorithms, of which the earlier mentioned Back-Propagation rule is an example.
We will discuss the BPN-rule later.
By repeating this procedure often with di erent examples, the net will learn to give
the right output for a given input.
We have mentioned the word training several times now. It refers to the situation
where we show the system several inputs and provide the required output as well.
The net is then adjusted. By doing this the net learns.
The contents of the training set is of crucial importance. First of all, it has to be
large enough. In order to get the system to generalize, a large set of examples has
to be available. Probably, a network trained with a small set will behave like a
memory, but a limited training set will never evoke the behaviour we are looking
for: adapting an error-tolerant, generalizing input-output relation.
The set also has to be su ciently rich. The notion we want the neural net to rec-
ognize has to be the only notion that is present everywhere in our training set. As
this may sound a bit vague, an example might be neccesary. If we have a set of
pictures of blond men and dark women, we could teach a neural net to determine
the sex of a person. But it might very well be that on showing this trained system
a blond girl, the net would say it's a boy. There are obviously two notions in order
here someone's sex and the colour of his or her hair.
In the theory of neural nets, one comes across more of these rather vague problems.
The non-deterministic nature of training makes that trained systems can get over-
trained and can even forget. We will not pay too much attention to these phenomena
now. We will discuss them later, when we have practical examples to illustrate them.
There is an aspect of learning that we have not yet discussed. We de ned training
as adjusting a neural net to the right input-output relation. This relation is then
de ned by the training set. This suggests that we train the network to give the
right output at every input from the training set.
If this is all that the system can achieve, it would be nothing more than a memory,
which we discussed in section 2. We also want the system to give output on input
that is not in the training set. And we want this output to be correct. By giving
the system a training set, we want the system to learn about other inputs as well.
Of course these will have to be close enough to the ones in the training set.
The right network con guration is crucial for the system to learn to generalize. If
the network is too large, it will be able to memorize the training set. If it is too
small, it simply will not be able to master the problem.
So con guring a net is very important. There are basically two ways of achieving
the right size. One is to begin with a rather big net. After some training, the
non-active neurons and synapses are removed, thus leaving a smaller net, which can
be further trained. This technique is called pruning. The other way is rather the
opposite. Start with a small net and enlarge it if it does not succeed in solving the
problem. This guarantees you to get the smallest net that does the job. But you
will have to train a whole new net every time you add some neurons.
5 The Back-Propagation Network
5.1 The idea of a BPN
In the previous section we mentioned a learning algorithm. This algorithm updated
the synaptic strengths after comparing an expected output with an actual output.
The algorithm should alter the weights to minimize the error next time.
One of the algorithms developed is the Error Back-Propagation algorithm. This
is the algorithm we will describe here and implement in the next section. We will
discuss a speci c case in detail. We will derive and implement this rule for a three-
x1 x2 x3 xN
Hidden i1 i L = fL(hL)
o1 o2 o3 oM = fM M
Figure 10: The network con guration we will solve
We want to minimize the error between expected output y and actual output o.
From now on we will be looking at a xed training-set pair: an input vector x and
an expected output y. The actual output o is the output that the net gives for the
We de ne the total error:
E=2 1X 2 (13)
where k is the di erence between the expected and actual output of output neuron
k: k = (yk ; ok ).
Since all the information of the net is in its weights, we could look at E as a function
of all its weights. We could regard the error to be a surface in W R, where W is
the weights space. This weights space has as dimension the number of synapses in
the entire network. Every possible state of this network is represented by a point
(wh wo ) in W .
Now we can look at the derivative of E with respect to W . This gives us the gradi-
ent of E , which always points in the direction of steepest ascent of the surface. So
;grad(E ) points in the direction of steepest descent . Adjusting the net to a point
(wh wo ) in the direction of ;grad(E ) secures that the net will perform better next
time. This procedure is visualized in gure 11.
Figure 11: The error as a function of the weights
5.2 Updating the output-layer weights
We will calculate the gradient of E in two parts and start with the output-layer
@E = ;(y ; o ) @fk @ (ho )
@wo kj @ (ho ) @wo k kj
Because we have not yet chosen an activation function f , we can not yet evaluate
@ (ho ) . We will refer to it as fk (hk ). What we do know is:
@ (ho ) = @ X wo i + o = i
@wkj @wkj l=1 kl l k j
Combining the previous equations gives:
@E = ;(y ; o )f o (ho )i 0
k k k k j
Now we want to change wkj in the direction of ; @wkj . We de ne:
k = (yk ; ok )fk (ho )
Then we can update wo according to:
wkj (t + 1) = wkj (t) + k ij
o o o (18)
where is called the learning-rate parameter. determines the learning speed, the
extent to which the w is adjusted in the gradient's direction. If it is too small, the
system will learn very slowly. If it is too big, the algorithm will adjust w too strongly
and the optimal situation will not be reached. The e ects of di erent values of
are discussed further in section 7
5.3 Updating the hidden-layer weights
To update the hidden-layer weights we will follow a procedure roughly the same as
in section 5.2. In section 5.2 we looked at E as a function of the output-neuron
values. Now we will look at E as a function of the hidden-neuron values ij .
E = 1 (yk ; ok )2
= 1 (yk ; fk (ho ))2
X X o
= 1 (yk ; fk ( wkj ij + k ))2
And now we examine @wji :
@E = 1 X @ (y ; o )2
h 2 k @wji k k
X @o o @i @hh
= ; (yk ; ok ) @hk @hk @hjh @wh
k k j ji
We can deal with these four derivatives the same way as section 5.2. The rst and
the third are clearly equal to the unknown derivatives of f . The second is equal to:
@ (ho ) = @ X wo i + o = wo
@ij @ij j =1 kj j k kj
For the same reason, the last derivative is xi . So we have:
@E = ; X(y ; o )f o wo f h x 0 0
k k k kj j i
We de ne a h similar to the one in (17):
j = fjh (hh )
j (yk ; ok )fk (ho )wkj
= fjh0 (hh )
j k wkj
o o (21)
Looking at the de nition of h , we can see that updating wji in the direction of
@wji is equal to:
wji(t + 1) = wji(t) + jh xi
h h (22)
where is again the learning parameter.
6 A BPN algorithm
In the next sections we will demonstrate a few phenomena as described in chapter 4,
using an application of a (2,2,1) back-propagation network. We have seen this rel-
atively simple network before in subsection 3.3. The XOR-gate described there will
be the rst problem we solve with an application of the bpn. In this section we will
formulate a general (2,2,1)-bpn training algorithm.
6.1 Choice of the activation function
Since we will be simulating the XOR-gate, which has outputs ;1 and 1 only, it is an
obvious choice to use a sigmoidal activation function. We will use f (x) = tanh(x).
f(x) = tanh(x)
df/dx = 1 − tanh^2(x)
−5 −4 −3 −2 −1 0 1 2 3 4 5
Figure 12: A sigmoidal activation function
We will also need its derivative. Since tanh(x) = cosh(x)) , we have:
tanh = ex ; e;x
Deriving this expression yields:
tanh0 (x) = 1 ; (ex ; e;x )2 = 1 ; tanh2 (x)
(e + e )
6.2 Con guring the network
We are going to use a three-layer net, with two input neurons, two hidden neurons
and one output neuron. As we have already chosen the activation function, we
now only have to decide how to implement the thresholds. In section 5 we did not
mention them. This was not necessary, since we will show here that they are easily
treated as ordinary weights.
We add a special neuron to both the input and the hidden layer and we de ne the
state of this neuron equal to 1. This neuron therefore takes no input from previous
layers, since they would have no impact anyway. The weight of a synapse between
this special neuron and one in the next layer then acts as the threshold for this
neuron. When the activation value for a neuron is computed, it now looks like:
hj = wi j si + wi+1 j 1 = wi j si
Neuron k + 1 is the special neuron that always has a state equal to 1.
In gure 13, we give an example of such a net.
x1 x2 1
i1 i2 1
Figure 13: A (2,2,1) neural net with weights as thresholds
This approach enables us to implement the network by using the techniques from
section 5, without paying special attention to the thresholds.
6.3 An algorithm: train221.m
Given the above-mentioned choices and the explicit method described in section 5,
we can now implement a training function for the given situation. Appendix A.3
gives the source of train221.m. This function is used as follows:
WH,WO,E] = train221(Wh,Wo,x,y,eta)
where the inputs Wh and Wo represent the current weights in the network, (x,y) is
a training input-output pair and eta is the learning parameter.
The outputs WH and WO are the updated weight matrices and E is the error, as
computed before the update.
7 Application I: the XOR-gate
7.1 Results and performance
We will now use the algorithm to solve the XOR-gate problem. First, we de ne our
ST = (0 0 0) (0 1 1) (1 0 1) (1 1 0)
The elements of this set are given as input to the training algorithm introduced in
the previous section. This is done by a special m- le, which also stores the error
terms. These error terms enable us to analyse the training behaviour of the net. In
the rest of the section, we will describe several phenomena, using the information
the error terms give us.
When looking at the performance of the net, we can look, for instance, at the error
of the net on an input-ouput pair of the training set, (xi yi ):
Ei = (yi ; oi )2
with yi as the expected output and oi as the output of the net with xi as in-
put. A measure of performance on the entire training set is the Sum of Squared
SSE = Ei
Clearly, the SSE is an upper bound for every Ei. We will use this frequently when
examining the net's performance. If we want the error on every training set element
to converge to zero, we just compute the SSE and check that it does this.
Now we will have a rst look at the results of training the net on ST . Figure 14
shows some of the results:
#iters E1 E2 E3 E4 SSE
0.2 100 0.0117 0.1694 0.1077 0.4728 0.7615
200 0.0003 0.0105 0.0110 0.0009 0.0226
300 2:6 10;5 0.0032 0.0032 0.0001 0.0065
400 8:4 10;6 0.0018 0.0018 2:3 10;5 0.0036
0.4 100 0.0037 0.0373 0.0507 0.0138 0.1055
200 0.0004 0.0031 0.0032 0.0082 0.0149
300 2:3 10;6 0.0013 0.0013 0.0029 0.0055
400 0.0001 0.0008 0.0008 0.0019 0.0036
Figure 14: Some training results
As we see in gure 14, both training sessions are succesful, as the SSE becomes
very small. We see that with larger, SSE converges to zero faster. This suggests
taking large values for . To see if this stategy would be succesful, we repeat the
experiment with various values of .
In gure 15, the SSE is plotted versus the number of training laps for various .
We can see that, for = :2, the SSE converges to zero. For = :4, SSE converges
faster, but less smoothly. After 150 trainings, the SSE has a little peak. Taking
larger , as suggested above, does not seem very pro table. When is :6, SSE has
1.6 eta = .1
eta = .2
eta = .4
eta = .6
eta = .8
0 50 100 150 200 250
Figure 15: SSE vs. number of training laps, for various
strong oscillations and with = :8, SSE does not even converge to zero.
This non-convergence for large can be explained by the error-surface view as used
in section 5. We regard the error as function of all the weights. This leads to an error
surface on the weights space. We used the gradient of E in this space to minimize
the error. expresses the extent to which we change the weights in the direction of
the opposite of the gradient. In this way we hope to nd a minimum in this space.
If is too large, we can jump over this minimum, approach it from the other side
and jump over it again. Thus, we will never reach it and the error will not converge.
The conclusion seems to be that the choice of is important. If it is too small, the
network will learn very slowly. Larger lead to faster learning, but the network
might not reach the optimal solution.
We have now trained a network to give the outputs at inputs from the training set.
And in this speci c case, these are the only inputs we are interested in. But the
net does give outputs on other inputs. Figure 16 shows the output on inputs in
the square between the training-set inputs. The graph clearly shows the XOR-gate's
outputs on the four corners of the surface.
In this case, we were only interested in the training-set elements. What the net does
by representing these four states right, is actually only remembering by learning.
Later, we will be looking at cases where we are interested in the outputs on inputs
outside the training set. Then we are investigating the generalizing capabilities of
0 2 4 6 8 0
Figure 16: The output of the XOR-net
7.2 Some criteria for stopping training
When using neural networks in applications, we will in general not be interested
in all the SSE curves etc. In these cases, training is just a way to get a well-
performing network, which, after stopping training4, can be used for the required
purposes. There are several criteria to stop training.
7.2.1 Train until SSE a
A very obvious method is to choose a value a and stop training as soon as the SSE
gets below this value. In the examples of possible SSE-curves we have seen sofar,
the SSE, for suitable , converges more or less monotonically, to zero. So it is bound
to decrease below any value required.
Choosing this value depends on the accuracy you demand. As we saw before, the
SSE is an upper bound for the Ei, which was the square of y ; o. So if we tolerate
a di erence of c between the expected output and the net's output, we want:
Ei c2 8i
Since SSE is an upperbound, we could use SSE c2 as a stopping criterion.
The advantage of this criterion is that you know a lot about the performance of
4 Unless the input-outputrelation is changingthrough time and we will have to continuetraining
the new situations.
0 50 100 150 200 250 300
Figure 17: Stopping training after 158 laps, when SSE 0:1
the net if training is stopped by it. A disadvantage is that training might not stop
in some situations. Some situations are too complex for a net to reach the given
7.2.2 Finding an SSE-minimum
The disadvantage of the previous method suggests another method. If SSE does not
converge to zero, we want to at least stop training at its minimum. We might train
a net for a very long time, plot the SSE and look for its global minimum. Then we
retrain the net under the same circumstances and stop at this optimal point. This
is not realistic however, since training in complex situations can take a considerable
amount of time and complete retraining would take too long.
Another approach is to stop training as soon as SSE starts growing. For small
, this might work, since we noticed before that choosing a small leads to very
smooth and monotonic SSE-curves. But there is still a big risk of ending up in a
local SSE-minimum. Training would stop just before a little peak in the SSE-curve,
although training on would soon lead to even better results.
The advantages are obvious given a complex situation with non-convergent SSE, we
still reach a relatively optimal situation. The disadvantages are obvious too. This
method might very well lead to suboptimal stopping, although we can limit this
risk by choosing small and maybe combining the two previous techniques: train
the network through the rst fase with the rst criterion and then nd a minimum
in the second, smoother fase with the second criterion.
Forgetting is another phenomenon we will demonstrate here. So far our training
has consisted of subsequently showing the net all the training-set elements an equal
number of times. We will show that this is very important.
Figure 19 shows the error during training on all the individual training-set elements.
0 20 40 60 80 100 120 140 160 180 200
Figure 18: Finding an SSE-minimum
It is clear that these functions Ei do not converge monotonically. While the error
on some elements decreases, the error on others increases. This suggests that train-
ing the net on element a might negatively in uence the performance of the net on
another element b.
This is the basis for the proces of forgetting. If we stop training an element, training
the other elements in uences the performance on this element negatively and causes
the net to forget the element.
In gure 20 we see the results of the following experiment. We start training the net
on element 1. We can see that the performance on the elements 3 and 4 becomes
worse. Surprisingly, the performance on element 2 improves along with element 1.
After 50 rounds of training, we stop training element 1 and start training the other
three elements. Clearly, the error on element 1, E1, increases dramatically and the
net ends up performing well on the other three. The net forgot element 1.
0 20 40 60 80 100 120 140 160 180 200
Figure 19: Training per element eta = .2
0 10 20 30 40 50 60 70 80 90 100
Figure 20: This net forgets element 1
8 Application II: Curve tting
In this section we will look at another application of three-layered networks. We
will try to use a network to represent a function f : R ! R. We use a network
with one input and one output neuron. We will take ve sigmoidal hidden neurons.
The output neuron will have a linear activation function, because we want it to
have outputs not just in the ;1 1]-interval. The rest of the network is similar to
that used in the previous section. Also, the training algorithm is analogous and
therefore not printed in the appendices. The matter of choosing was discussed
in the previous section and we will let it rest now. For the rest of the section, we
will use = :2, which will turn out to give just as smooth and convergent training
results as in the previous section.
Figure 21: A (1,5,1)-neural network
8.1 A parabola
We will try to t the parabola y = x2 and train the network with several inputs
from the 0 1]-interval. The training set we use is:
ST = (0 0) (:1 :01) (:5 :25) (:7 :49) (1 1)
Training the network shows that the SSE converges to zero smoothly. In this section
we will focus less on the SSE and more on the behaviour of the trained network.
In the previous section, we wanted the network to perform well on the training set.
In this section we want the network to give accurate predictions of the value of x2,
with x the input value, and not just on the ve training pairs. So we will not show
the SSE graph here. We will plot the networks prediction of the parabola.
As we can see, the network predicts the function really well. After 400 training
runs we have a fairly accurate prediction of the parabola. It is interesting whether
the network also has any knowledge of what happens outside the 0:1]-interval, so
whether it can predict the value outside that interval. Figure 24 shows that the
network fails to do this. Outside its training set, its performance is bad.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 22: The networks prediction after 100 training runs
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 23: The networks prediction after 400 training runs
8 actual value
0 0.5 1 1.5 2 2.5 3
Figure 24: The network does not extrapolate
8.2 The sine function
In this subsection we will repeat the experiment from the previous subsection for
the sine function. Our training set is:
ST = (0 0) (:8 :71) (1:57 1) (2 :9) (3:14 0)
These are the results of training a net on ST :
0 0.5 1 1.5 2 2.5 3
Figure 25: The networks prediction after 400 runs
0 0.5 1 1.5 2 2.5 3
Figure 26: The networks prediction after 1200 runs
Obviously, this problem is a lot harder to solve for the network. After 400 runs, the
performance is not good yet and even after 1200 runs there is a noticeable di erence
between the prediction and the actual value of the sine function.
An interesting phenomenon is that of overtraining. So far, the only measure of
performance has been the SSE on the training set, on which the two suggested
stopping criteria were based. In this section, we abandon the SSE-approach because
we are interested in the performance on sets larger than just the training set. SSE-
stopping criteria combined with this new objective of performance on larger sets
can lead to overtraining. We give an example. We trained two networks on:
ST = (0 0) (1 1) (1:5 2:25) (2 4) (7 49)
Here are the training results:
0 1 2 3 4 5 6 7
Figure 27: Network A predicting the parabola
0 1 2 3 4 5 6 7
Figure 28: Network B predicting the parabola
The question is which of the above networks functions best. With the SSE on ST
in mind, the answer is obvious: network B has a very small SSE on the training
set. But we mentioned before that we wanted the network to perform on a wider
set. So maybe we should prefer network A after all.
In fact, network B is just a longer-trained version of network A. We call network B
overtrained. Using the discussed methods of stopping training can lead to situations
like this, so these criteria might not be satisfactory.
8.4 Some new criteria for stopping training
We are looking for a criterion to stop training which avoids the illustrated problems.
But the SSE is the only measure of performance we have so far. We will therefore
use a combination of the two.
As we are interested in the performance of the net on a wider set than just ST ,
we introduce a reference set SR with input-output elements that are not in ST but
represent the area on which we want the network to perform well. Now we de ne
the performance of the net as the SSE on SR . When we start training a net with
ST , the SSE on SR is likely to decrease, due to the generalizing capabilities of neu-
ral networks. As soon as the network becomes overtrained, the SSE on SR increases.
Now we can use the stopping criteria from subsection 7.2 with the SSE on SR .
We illustrate this technique in the case of the previous subsection. We de ne:
SR = (2:5 6:25) (3 9) (4 16) (5 25) (6 36)
and we calculate the SSE on both ST and SR .
SSE op St
SSE op Sr
0 20 40 60 80 100 120 140 160 180 200
Figure 29: The SSE on the training set and the reference set
Using the old stopping criteria would obviously lead to network B. A stopping cri-
terion that would terminate training somewhere close to the minimum of the SSE
on SR would lead to network A.
In this case, the overtraining is caused by a bad training set ST . It contains all
training-pairs on the 0 2] interval and one quite far from that interval. Training
the net on SR would have given a much better result.
What we wanted to show however was what happens if we keep training too long on
a too limited training set: the net indeed does memorize the entries of the training
set, but its performance on the neighbourhood of this training set gets worse after
8.5 Evaluating the curve tting results
In the last few sections, we have not been interested in the individual neurons.
Instead, we just looked at the entire network and its performance. We did this
because we wanted the network to solve the problem. The strong feature of neural
networks is that we do not have to divide the problem into subproblems for the
It can be interesting though, to look back. We will now analyze the role of every
neuron in the two trained curve- tting networks.
We start with the 5 hidden neurons. Their output was the tanh over their activation
ik = tanh(wk x + k )
The output neuron takes a linear combination of these 5 tanh-curves:
ok = wlo il + o
;X wo tanh(wh x + h ) +
l l l
So the network is trying to t 5 tanh-curves to the tted curve as accurately as
possible. We can plot the 5 curves for both the tted parabola and the sine:
In this case, only one neuron has a non-trivial output, the other four are more or
less constant, a role o could have full lled easily. This leads us to assume that the
parabola could have been tted by a (1,1,1) network.
The sine function is more complex. Fitting a non-monotonic function with mono-
tonics obviously takes more functions. Neuron 2 has a strongly increasing function
as output. Because of the symmetry of the sine, we would expect another neuron
to have a equally decreasing output function. It appears that this task has been
divided between neurons 3 and 4 they both have a decreasing output function and
they would probably add up to the symmetric function we expected. The two other
neurons have a more or less constant value.
For the same reasons as we mentioned with the parabola, we might expect that this
problem could have been solved by a (1,3,1) or even a (1,2,1) network.
Analyzing the output of the neurons after training can give a good idea of the min-
imum size of the network required to solve the problem. And we saw in section 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 30: The tanh-curves used to t the parabola
0 0.5 1 1.5 2 2.5 3 3.5
Figure 31: The tanh-curves used to t the sine
that over-dimensioned networks can lose their generalizing capabilities fast. Ana-
lyzing the neurons could lead to removing neurons from the network and improving
its generalizing capabilities.
There is another interesting evaluation method. We could replace the hidden-output
results with their Taylor polynomials. This would lead to a polynomial as output
function. Question is if this polynomial would be identical to the Taylor polyno-
mial of the required output function. Since the functions coincide on an interval,
the polynomials would be probably identical for the rst number of coe cients.
This could lead to a theory on how big a network needs to be in order to t a
function with a given Taylor polynomial. But this would take further research.
9 Application III: Times Series Forecasting
In the previous section, we trained a neural network to adopt the input-output re-
lation of two familiar functions. We used training pairs (x f (x)). And although
performance was acceptabel after small numbers of training, this application had
one shortcoming: it did not extrapolate at all. Neural networks will in general
perform weakly outside their training set, but a smart input-output choice can
overcome these limits.
In this section, we will look at time series. A time series is a vector ~t , with xed
distances between subsequent ti . Examples of time series are the daily stock prices,
rainfall in the last twenty years and actually every measured quantity over discrete
Predicting a future value of y, say yt is now done based on for instance yt;1 : : :yt;n
but not on t. In this application we will take n = 2 and try to train a network to
give valuable predictions of yt .
Figure 32: The network used for TSF
We take a network similar to the network we used in the previous section. Only
we now have 2 input neurons. The 5 hidden neurons still have sigmoidal activation
functions and the output neuron has a linear activation function.
Of course we can look at any function f (x) as a time series. We associate with
every entry ti of a vector ~ the value of f (ti ). We will rst try to train the network
on the sine function again.
We take ~ = f0 :1 :2 : :: 6:3g and yt = sin(t). Training this network enables us to
predict the sine of t given the sine of the two previous values of t: t ; :1 and t ; :2.
But we could also predict the sine of t based on the sines of t ; :3 and t ; :2: these
two values gives us a prediction of sin(t ; :1) and thus we can predict sin(t). Of
course, basing a prediction on a prediction is less accurate than the prediction based
on two actual sine values. The results of the network is plotted in gure 33.
Because we trained the net to predict based on previous behaviour, this network
predicting 3 deep
predicting 2 deep
predicting 1 deep
0 1 2 3 4 5 6
Figure 33: The networks performance after 400 training runs
will extrapolate, since the sine-curve's behaviour is periodical.
0.8 actual value
1 2 3 4 5 6 7 8 9 10 11
Figure 34: This network does extrapolate
In this paper we introduced a technique that in theory should lead to good training
behaviour for three-layered neural networks. In order to achieve this behaviour, a
number of important choices has to be made.
1. the choice of learning parameter
2. the choice of the training set ST
3. the con guration of the network
4. the choice of a stopping criterion
In application I, we focussed on measuring the SSE and saw that its behaviour was
strongly dependent on the choice of . A small leads to smooth and convergent
SSE-curves and therefore to satisfying training results. In our example, = :2
was small enough, but the maximum value of may vary. If an SSE curve is not
convergent or is not smooth, one should always try a smaller .
Also, choosing ST is crucial. In application II we saw that with a non-representative
training set, a trained network will not generalize well. And if you are not only in-
terested in performance on ST , just getting the SSE small is not enough. The
reference-set-SSE method is a good way to reach a compromise acceptable perfor-
mance on ST combined with a reasonable performance on its neighbourhood.
Neural networks seem to be a useful technique to learn the relation between data
sets in cases where we have no knowledge of what the characteristics of the relation
will be. The parameters determining the network's success are not always clear,
but there are enough techniques to make these choices.
A Source of the used M- les
A.1 Associative memory: assostore.m, assorecall.m
function w = assostore(S)
% ASSOSTORE(S) has as output the synaptic strength
% matrix w for the associative memory with contents
% the rowvectors of S.
for i=1 : N
for j=1 : n
function s= assorecall(sigma,w)
% ASSORECALL(g,w) returns the closest contents of
% memory w, stored by ASSOSTORE.
A.2 An example session
< M A T L A B (R) >
(c) Copyright 1984-94 The MathWorks, Inc.
All Rights Reserved
Dec 31 1994
>> S = 1,1,1,1,0,0,0,0 0,0,0,0,1,1,1,1]
1 1 1 1 0 0 0 0
0 0 0 0 1 1 1 1
>> assorecall( 1,1,0,0,0,0,0,0],w)
1 1 1 1 0 0 0 0
>> assorecall( 0,0,0,0,0,0,1,1],w)
0 0 0 0 1 1 1 1
A.3 BPN: train221.m
function Wh,Wo,E] = train221(Wh,Wo, x, y, eta)
% train221 trains a (2,2,1) neural net with sigmoidal
% activation functions. It updates the weights Wh
% and Wo for input x and expected output y. eta is
% the learning parameter.
% Returns the updated matrices and the error E
% Usage: Wh,Wo,E] = train221(Wh,Wo, x, y, eta)
%% Computing the networks output %%
hi = Wh* x 1]
i = tanh(hi)
ho = Wo * i 1]
o = tanh(ho)
E = y - o
%% Back Propagation %%
% Computing deltas
deltao = (1 - o^2) * E
deltah1 = (1 - (i(1))^2) * deltao * Wo(1)
deltah2 = (1 - (i(2))^2) * deltao * Wo(2)
% Updating Outputlayer weights
Wo(1) = Wo(1) - eta * deltao * i(1)
Wo(2) = Wo(2) - eta * deltao * i(2)
Wo(3) = Wo(3) - eta * deltao
% Updating Hiddenlayer weights
Wh(1,1) = Wh(1,1) - eta * deltah1 * x(1)
Wh(1,2) = Wh(1,2) - eta * deltah1 * x(2)
Wh(1,3) = Wh(1,3) - eta * deltah1
Wh(2,1) = Wh(2,1) - eta * deltah2 * x(1)
Wh(2,2) = Wh(2,2) - eta * deltah2 * x(2)
Wh(2,3) = Wh(2,3) - eta * deltah2
Freeman] James A. Freeman and David M. Skapura, Neural Networks, Algo-
rithms, Applications and Programming Techniques, Addison-Wesley,
Muller] B. Muller, J. Reinhardt, M.T. Strickland, Neural Networks, An Intro-
duction, Berlin, Springer Verlag, 1995.
N rgaard] Magnus N rgaard, The NNSYSID Toolbox,