A Matlab Implementation Of Nn


Published on

Published in: Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Matlab Implementation Of Nn

  1. 1. A Matlab-implementation of neural networks Jeroen van Grondelle July 1997 1
  2. 2. Contents Preface 4 1 An introduction to neural networks 5 2 Associative memory 7 2.1 What is associative memory? . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Implementing Associative Memory using Neural Networks . . . . . . 7 2.3 Matlab-functions implementing associative memory . . . . . . . . . . 9 2.3.1 Storing information . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Recalling information . . . . . . . . . . . . . . . . . . . . . . 9 3 The perceptron model 10 3.1 Simple perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 The XOR-problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Solving the XOR-problem using multi-layered perceptrons . . . . . . . 12 4 Multi-layered networks 13 4.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3 Generalizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5 The Back-Propagation Network 15 5.1 The idea of a BPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.2 Updating the output-layer weights . . . . . . . . . . . . . . . . . . . 16 5.3 Updating the hidden-layer weights . . . . . . . . . . . . . . . . . . . 17 6 A BPN algorithm 18 6.1 Choice of the activation function . . . . . . . . . . . . . . . . . . . . 18 6.2 Con guring the network . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.3 An algorithm: train221.m . . . . . . . . . . . . . . . . . . . . . . . 19 7 Application I: the XOR-gate 20 7.1 Results and performance . . . . . . . . . . . . . . . . . . . . . . . . . 20 7.2 Some criteria for stopping training . . . . . . . . . . . . . . . . . . . 22 7.2.1 Train until SSE a . . . . . . . . . . . . . . . . . . . . . . . 22 7.2.2 Finding an SSE-minimum . . . . . . . . . . . . . . . . . . . . 23 7.3 Forgetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 8 Application II: Curve tting 26 8.1 A parabola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.2 The sine function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 8.3 Overtraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 8.4 Some new criteria for stopping training . . . . . . . . . . . . . . . . 31 8.5 Evaluating the curve tting results . . . . . . . . . . . . . . . . . . . 32 9 Application III: Times Series Forecasting 34 9.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Conclusions 36 2
  3. 3. A Source of the used M- les 37 A.1 Associative memory: assostore.m, assorecall.m . . . . . . . . . . 37 A.2 An example session . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 A.3 BPN: train221.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Bibliography 39 3
  4. 4. Preface Although conventional computers have been shown to be e ective at a lot of de- manding tasks, they still seem unable to perform certain tasks that our brains do so easily. These are tasks like for instance pattern recognition and various kinds of forecasting. That we do these tasks so easily has a lot to do with our learning capabilities. Conventional computers do not seem to learn very well. In January 1997, the NRC Handelsblad, in its weekly science subsection, published a series of four columns on neural networks, a technique that overcomes some of the above-mentioned problems. These columns aroused my interest in neural networks, of which I knew practically nothing at the time. As I was just looking for a subject for a paper, I decided to nd out more about neural networks. In this paper, I will start with giving a brief introduction to the theory of neural networks. Section 2 discusses associative memory, which is a simple application of neural networks. It is a exible way of information storage, allowing retrieval in an associative way. In sections 3 to 5, general neural networks are discussed. Section 3 shows the be- haviour of elementary nets and in section 4 and 5 this theory is extended to larger nets. The back propagation rule is introduced and a general training algorithm is derived from this rule. Sections 6 to 9 deal with three applications of the back propagation network. Using this type of net, we solve the XOR-problem and we use this technique for curve tting. Time series forecasting also deals with predicting function values, but is shown to be a more general technique than the introduced technique of curve tting. Using these applications, I demonstrate several interesting phenomena and criteria concerning implementing and training networks, such as stopping criteria, over- training and forgetting. Finally, I'd like to thank Rob Bisseling for his supervision during the process and Els Vermij for her numerous suggestions for improving this text. Jeroen van Grondelle Utrecht, July 1997 4
  5. 5. 1 An introduction to neural networks In this section a brief introduction is o ered to the theory of neural networks. This theory is based on the actual physiology of the human brain and shows a great resemblance to the way our brains work. The building blocks of neural networks are neurons . These neurons are nodes in the network and they have a state that acts as output to other neurons. This state depends on the input the neuron is given by other neurons. Input activation function Neuron threshold Output Figure 1: A neuron A neural network is a set of connected neurons. The connections are called synapses . If two neurons are connected, one neuron takes the output of the other neuron as input, according to the direction of the connection. Neurons are grouped in layers . Neurons in one layer only take input from the pre- vious layer and give output to the next layer 1 . Every synapse is associated with a weight. This weight indicates the impact of the output on the receiving neuron. The state of neuron i is de ned as: X ! si = f wik rk ; (1) k where rk are the states of the neurons that give input to neuron i and wi k represents the weight associated with the connection. f (x) is the activation function. This function is often linear or a sign-function whwn we require binary output. The sign function is generally replaced by a continuous representation of this function. The value is called the threshold. input input hidden layer ouput output Figure 2: A single and multi-layered network 1 Networks with connections skipping layers are possible, but we will not discuss them in this paper 5
  6. 6. A neural net is based on layers of neurons. Because the number of neurons is nite, there is always an input layer and an output layer , which only give output or take input respectively. All other layers are called hidden layers . A two-layered net is called a simple perceptron and the other nets multi-layered perceptrons. Examples are given in gure 2 6
  7. 7. 2 Associative memory 2.1 What is associative memory? In general, a memory is a system that both stores information and allows us to recall this information. In computers, a memory will usually look like an array. An array consists of pairs (i ), where is the information we are storing, and i is the index assigned to it by the memory on storage. We can recall the information by giving the index as input to the memory: input output M index information Figure 3: Memory recall This is not a very exible technique: we have to know exactly the right index to recall the stored information. Associative memory works much more like our mind does. If we are for instance looking for someone's name, it will help to know where we met this person or what he looks like. With this information as input, our memory will usually come up with the right name. A memory is called an associative memory if it permits the recall of information based on partial knowledge of its contents. 2.2 Implementing Associative Memory using Neural Net- works Neural networks are very well suited to create an associative memory. Say we wish to store p bitwords2 of length N . We want to recall in an associative way, so we want to give as input a bitword and want as output the stored bitword that most resembles the input. So it seems the obvious thing to do is to take an N-neuron layer as both input and output layer and nd a set of weights so that the system behaves like a memory for the bitwords 1 : : : p : Input 1 2 3 4 N Output 1 2 3 N Figure 4: An associative memory con guration If now a pattern s is given as input to the system, we want to be the output, so 2 For later convenience, we will work with binary numbers that consist of 1's and ;1's, where ;1 replaces the usual zero. 7
  8. 8. that s and di er in as few places as possible. So we want the error Hj X N Hj = (si ; ij )2 (2) 1=1 to be minimal if j = . This Hj is called the Hamming distance3 . We will have a look at a simple case rst. Say we want to store one pattern . We will give an expression for w and check that it suits our purposes: 1 wij = N i j (3) If we give an arbitrary pattern s as input, where s di ers in n places from the stored pattern , we get: 0N 1 0 1 X A 1X sA N Si = sign @ wij sj = sign @ i N j j (4) j =1 j =1 P Now examine N=1 j sj . If sj = j , then j sj = 1, otherwise it is ;1. Therefore, j the sum equals (N ; n) + ;n, and: 0 1 1 X A N 1 2n sign @ i N j sj = sign N i (N ; 2n) = sign 1 ; N i (5) j =1 There are two important features to check. First, we can see that if we choose s = , the output will be . This is obvious, because and di er in 0 places. We call this stability of the stored pattern. Secondly, we want to check that if we give an input reasonably close to , we get as output. Obviously, if n < N , the ; 2 output will equal . Then 1 ; 2n does not a ect the sign of i. This is called N convergence to a stored pattern. We now want to store all the words 1 : : : p . And again we will give an expression and prove that it serves our purpose. De ne wij = N1X p (6) i j =1 The method will be roughly the same. We will not give a full proof here. This would be too complex and is not of great importance to our argument. What is important is that we are proving stability of stored patterns and the convergence to a stored pattern of other input patterns. We did this for the case of one stored pattern. The method for multiple stored patterns is similar. Only, proving the error terms to be small enough will take some advanced statistics. Therefore, we will prove up to the error terms here and then quote Muller]: Because the problem is becoming a little more complex now, we will discuss the activation value for an arbitrary output neuron i, usually referred to as hi . First we will look at the output when a stored pattern (say ) is given as input: 0 N 1 X N 1X X p N 1 X X X N hi = wij j = N i j j = N @ i j j + i j j A (7) j =1 =1 j =1 j =1 6= j =1 3 Actually, this is the Hamming distance when bits are represented by 0's and 1's. The square then acts as absolute-value operator. So we should scale results by a constant factor 25 to obtain : the Hamming distance. 8
  9. 9. The rst part of the last expression is equal to i due to similar arguments as in the previous one-pattern case. The second expression is dealt with using laws of statistics, see Muller]. Now we give the system an input s where n neurons start out in the wrong state. Then generalizing (7) similar to (5) gives: hi = 1 ; 2n 1X X N N i +N i j sj (8) 6= j =1 The rst term is equal to that of the single-pattern storage case. And the second is again proven small by Muller]. Moreover, it is proven that r ! hi = 1 ; 2n +O p;1 (9) N i N So if p << N the system will still function as a memory for the p patterns. In Muller], it is proven that as long as p < :14N the system will function well. 2.3 Matlab -functions implementing associative memory In Appendix A.1 two Matlab functions are given for both storing and recalling information in an associative memory as described above. Here we will make some short remarks on how this is done. 2.3.1 Storing information The assostore-function works as follows: The function gets a binary matrix S as input, where the rows of S are the patterns to store. After determining its size, the program lls a matrix w with zeros. The values of S are transformed from (0,1) to (-1,1) notation. Now all values of w are computed using (6). This formula is implemented using the inner product of two columns in S . The division by N is delayed until the end of the routine. 2.3.2 Recalling information assorecall.m is also a straightforward implementation of the procedure described above. After transforming from (0,1) to (-1,1) notation, s is computed as w times the transposed input. The sign of this s is transformed back to (0,1) notation. 9
  10. 10. 3 The perceptron model 3.1 Simple perceptrons In the previous section, we have been looking at two-layered networks, which are also known as simple perceptrons. We did not really go into the details. An expres- sion for w was given and we simply checked that it worked for us. In this section we will look closer at what these simple perceptrons really do. Let us look at a 2-neuron input, 1-neuron output simple perceptron, as shown in gure 5. s1 s2 S1 Figure 5: A (2,1) simple perceptron This net has only two synapses, with weights w1 and w2 , and we assume S1 has threshold . We allow inputs from the reals and take as activation function the sign-function. Then the output is given by: S1 = sign(w1s1 + w2 s2 ; ) (10) There is also another way of looking at S1 . The inner product of w and s actually de nes the direction of a line in the input space. determines the location of this line and taking the sign over this expression determines whether the input is on one side of the line or at the other side. This can be seen more easily if we rewrite (10) as: S1 = ;1 if w1s1 + w2s2 > 1 if w1s1 + w2s2 < (11) So (2,1) simple perceptrons just divide the input space in two and return 1 at one half and -1 at the other. We visualize this in gure 6: w Figure 6: A simple perceptron dividing the input space We can of course generalize this to (n,1) simple perceptrons, in which case the perceptron de nes a (n-1)-dimensional hyperplane in the n-dimensional input space. The hyperplane view of simple perceptrons also allows looking at not too complex multi-layered nets.As we saw before, every neuron in the rst hidden layer is an 10
  11. 11. indicator of a hyperplane. But the next hidden layer again consists of indicators of hyperplanes, de ned this time on the output of the rst hidden layer. Multi-layered nets soon become far too complex to study in such a concrete way. In the literature we see that multi-layered nets are often regarded as black boxes. You know what goes in, you train until the output is right and you do not bother about the exact actions inside the box. But for relatively small nets, it can be very interesting to study the exact mechanism, as it can show whether or not a net is able to do the required job. This is exactly what we will do in the next subsection. 3.2 The XOR-problem As we have seen, simple perceptrons are quite easy to understand and their be- haviour is very well modelled. We can visualize their input-output relation through the hyperplane method. But simple perceptrons are very limited in the sort of problems they can solve. If we look for instance at logical operators, we can instantly see one of its limits. Although a simple perceptron is able to adopt the input-output relation of both the OR and AND operator, it is unable to do the same for the Exclusive-Or gate, the XOR-operator. s1 s2 S -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 Table 1: The truth table of the XOR-function We examine rst the AND-implementation on a simple perceptron. The input-output relation would be: AND 1 -1 1 -1 Figure 7: Input-output relation for the AND-gate Here the input is on the axes, and a black dot means output 1 and a white dot means output ;1. As we have seen in section 3.1, a simple perceptron will de ne a hyperplane, returning 1 at one side and -1 at the other. In gure 8, we choose a hyperplane for both the AND and the OR-gate input space. We immediately see why a simple perceptron will never simulate an XOR-gate, as this would take two hyperplanes, which a simple perceptron can not de ne. 11
  12. 12. AND OR XOR 1 1 1 -1 1 -1 1 -1 1 -1 -1 -1 Figure 8: A hyperplane choice for all three gates It is now almost trivial to nd the simple perceptron solution to the rst two gates. Obviously, (w1 w2) = (1 1) de nes the direction of the chosen line. It follows that for the AND-gate = 1 works well. In the same way we compute values for the OR-gate: w1 = 1 w2 = 1 and = ;1. When neural nets were only just invented and these obvious limits were discovered, most scientists regarded neural nets as a dead end. If problems this simple could not be solved, neural nets were never going to be very useful. The answer to these limits were multi-layered nets. 3.3 Solving the XOR-problem using multi-layered perceptrons Allthough the XOR-problem can not be solved by simple perceptrons, it is easy to show that it can be solved by a (2,2,1) perceptron. We could prove this by giving a set of suitable synapses and prove its functioning. We could also go deeper into the hyperplane method. Instead of these options, we will use some logical rules and express the XOR operator in terms of OR and AND operators, which we have seen we can handle. It can be easily shown that: (s1 XOR s2 ) , (s1 ^ :s2 ) _ (:s1 ^ s2 ) (12) We have neural net implementations of the OR and AND operator. Because we are using 1 and -1 as logical values, :s1 is equal to ;s1 . This makes it easy to put s1 and :s2 in a neural AND-gate. We will just negate the synapse that leads from :s2 to S1 and use s2 as input instead of :s2 . This suggests the following (2,2,1)-solution: The input layer is used as usual and feeds the hidden layer, consisting of hs1 and hs2 . These function as AND-gates as indicated in (12). S , the only element in the output layer, implements the OR-symbol in (12). By writing down the truth table for the system, it can easily be shown that the given net is correct. 12
  13. 13. s1 s2 -1 1 -1 1 θ =1 θ =1 1 1 θ=-1 S Figure 9: A (2,2,1) solution of the XOR-gate 13
  14. 14. 4 Multi-layered networks In the previous section, we studied a very speci c case of multi-layered networks. We could determine its synaptic strengths because it was a combination of several simple perceptrons, which we studied quite thoroughly before, and because we could reduce the original problem to several subproblems that we already solved using neural nets. In the preface, several tasks were mentioned such as character recognition, time series forecasting, etc. These are all very demanding tasks, which need considerably larger nets. These tasks are also problems we do not understand so well. So we are not able to de ne subproblems, which we could solve rst. The strong feature of neural nets that we are going to use here is that, by training, the net will learn the input-output relation we are looking for. We are not concerned with the individual function of neurons in this section we will consider the net as the earlier mentioned black box. 4.1 Learning Let us discuss a concrete example here. A widely used application of neural nets is that of character recognition. The input of our black box could then be for instance a 8 8 matrix of ones and zeros, representing a scan of a character. The output could consist of 26 neurons, representing the 26 characters of the alphabet. Since we do not have a concrete solution in for instance hyperplane or logical terms to implement in a net, we choose more or less at random a net con guration and synaptic strengths for this net. Not all net con gurations are able to learn all prob- lems (we have seen a very obvious example of that before) but there are guidelines and rough estimations on how large a net has to be. We will not go into that right now. Given our net, every scan given as input will result in an output. It is not very likely that this net will do what we want from the start, since we initiated it randomly. It all depends on nding the right values for the synaptic strenghts. We need to train the net. We give it an input and compare the output with the result we wanted to get. And then we will adjust the synaptic strenghts. This is done by learning algorithms, of which the earlier mentioned Back-Propagation rule is an example. We will discuss the BPN-rule later. By repeating this procedure often with di erent examples, the net will learn to give the right output for a given input. 4.2 Training We have mentioned the word training several times now. It refers to the situation where we show the system several inputs and provide the required output as well. The net is then adjusted. By doing this the net learns. The contents of the training set is of crucial importance. First of all, it has to be large enough. In order to get the system to generalize, a large set of examples has to be available. Probably, a network trained with a small set will behave like a memory, but a limited training set will never evoke the behaviour we are looking for: adapting an error-tolerant, generalizing input-output relation. The set also has to be su ciently rich. The notion we want the neural net to rec- ognize has to be the only notion that is present everywhere in our training set. As this may sound a bit vague, an example might be neccesary. If we have a set of pictures of blond men and dark women, we could teach a neural net to determine the sex of a person. But it might very well be that on showing this trained system a blond girl, the net would say it's a boy. There are obviously two notions in order here someone's sex and the colour of his or her hair. 14
  15. 15. In the theory of neural nets, one comes across more of these rather vague problems. The non-deterministic nature of training makes that trained systems can get over- trained and can even forget. We will not pay too much attention to these phenomena now. We will discuss them later, when we have practical examples to illustrate them. 4.3 Generalizing There is an aspect of learning that we have not yet discussed. We de ned training as adjusting a neural net to the right input-output relation. This relation is then de ned by the training set. This suggests that we train the network to give the right output at every input from the training set. If this is all that the system can achieve, it would be nothing more than a memory, which we discussed in section 2. We also want the system to give output on input that is not in the training set. And we want this output to be correct. By giving the system a training set, we want the system to learn about other inputs as well. Of course these will have to be close enough to the ones in the training set. The right network con guration is crucial for the system to learn to generalize. If the network is too large, it will be able to memorize the training set. If it is too small, it simply will not be able to master the problem. So con guring a net is very important. There are basically two ways of achieving the right size. One is to begin with a rather big net. After some training, the non-active neurons and synapses are removed, thus leaving a smaller net, which can be further trained. This technique is called pruning. The other way is rather the opposite. Start with a small net and enlarge it if it does not succeed in solving the problem. This guarantees you to get the smallest net that does the job. But you will have to train a whole new net every time you add some neurons. 15
  16. 16. 5 The Back-Propagation Network 5.1 The idea of a BPN In the previous section we mentioned a learning algorithm. This algorithm updated the synaptic strengths after comparing an expected output with an actual output. The algorithm should alter the weights to minimize the error next time. One of the algorithms developed is the Error Back-Propagation algorithm. This is the algorithm we will describe here and implement in the next section. We will discuss a speci c case in detail. We will derive and implement this rule for a three- layered network. x1 x2 x3 xN Input h h Hidden i1 i L = fL(hL) Output o o o1 o2 o3 oM = fM M (h ) Figure 10: The network con guration we will solve We want to minimize the error between expected output y and actual output o. From now on we will be looking at a xed training-set pair: an input vector x and an expected output y. The actual output o is the output that the net gives for the input vector. We de ne the total error: E=2 1X 2 (13) k k where k is the di erence between the expected and actual output of output neuron k: k = (yk ; ok ). Since all the information of the net is in its weights, we could look at E as a function of all its weights. We could regard the error to be a surface in W R, where W is the weights space. This weights space has as dimension the number of synapses in the entire network. Every possible state of this network is represented by a point (wh wo ) in W . Now we can look at the derivative of E with respect to W . This gives us the gradi- ent of E , which always points in the direction of steepest ascent of the surface. So ;grad(E ) points in the direction of steepest descent . Adjusting the net to a point (wh wo ) in the direction of ;grad(E ) secures that the net will perform better next time. This procedure is visualized in gure 11. 16
  17. 17. E DE -grad(E) W-space Figure 11: The error as a function of the weights 5.2 Updating the output-layer weights We will calculate the gradient of E in two parts and start with the output-layer weights: @E = ;(y ; o ) @fk @ (ho ) o k (14) k k @wo kj @ (ho ) @wo k kj Because we have not yet chosen an activation function f , we can not yet evaluate o @fk @ (ho ) . We will refer to it as fk (hk ). What we do know is: 0 o o k @ (ho ) = @ X wo i + o = i k L (15) @wkj @wkj l=1 kl l k j o o Combining the previous equations gives: @E = ;(y ; o )f o (ho )i 0 (16) k k k k j @wkj o Now we want to change wkj in the direction of ; @wkj . We de ne: o @E o o k = (yk ; ok )fk (ho ) o 0 k (17) Then we can update wo according to: wkj (t + 1) = wkj (t) + k ij o o o (18) where is called the learning-rate parameter. determines the learning speed, the extent to which the w is adjusted in the gradient's direction. If it is too small, the system will learn very slowly. If it is too big, the algorithm will adjust w too strongly and the optimal situation will not be reached. The e ects of di erent values of are discussed further in section 7 17
  18. 18. 5.3 Updating the hidden-layer weights To update the hidden-layer weights we will follow a procedure roughly the same as in section 5.2. In section 5.2 we looked at E as a function of the output-neuron values. Now we will look at E as a function of the hidden-neuron values ij . X E = 1 (yk ; ok )2 2 k X = 1 (yk ; fk (ho ))2 2 k o k X X o = 1 (yk ; fk ( wkj ij + k ))2 2 k o o j And now we examine @wji : @E h @E = 1 X @ (y ; o )2 @wji h 2 k @wji k k h X @o o @i @hh = ; (yk ; ok ) @hk @hk @hjh @wh o @ij j k k j ji We can deal with these four derivatives the same way as section 5.2. The rst and the third are clearly equal to the unknown derivatives of f . The second is equal to: @ (ho ) = @ X wo i + o = wo k L (19) @ij @ij j =1 kj j k kj For the same reason, the last derivative is xi . So we have: @E = ; X(y ; o )f o wo f h x 0 0 (20) k k k kj j i @wji h k We de ne a h similar to the one in (17): X h j = fjh (hh ) 0 j (yk ; ok )fk (ho )wkj o k 0 o X k = fjh0 (hh ) j k wkj o o (21) k Looking at the de nition of h , we can see that updating wji in the direction of h @wji is equal to: @E h wji(t + 1) = wji(t) + jh xi h h (22) where is again the learning parameter. 18
  19. 19. 6 A BPN algorithm In the next sections we will demonstrate a few phenomena as described in chapter 4, using an application of a (2,2,1) back-propagation network. We have seen this rel- atively simple network before in subsection 3.3. The XOR-gate described there will be the rst problem we solve with an application of the bpn. In this section we will formulate a general (2,2,1)-bpn training algorithm. 6.1 Choice of the activation function Since we will be simulating the XOR-gate, which has outputs ;1 and 1 only, it is an obvious choice to use a sigmoidal activation function. We will use f (x) = tanh(x). 1 0.8 0.6 0.4 0.2 0 −0.2 f(x) = tanh(x) −0.4 df/dx = 1 − tanh^2(x) −0.6 −0.8 −1 −5 −4 −3 −2 −1 0 1 2 3 4 5 Figure 12: A sigmoidal activation function We will also need its derivative. Since tanh(x) = cosh(x)) , we have: sinh( x ;x tanh = ex ; e;x x e +e Deriving this expression yields: ;x 2 tanh0 (x) = 1 ; (ex ; e;x )2 = 1 ; tanh2 (x) x (e + e ) 6.2 Con guring the network We are going to use a three-layer net, with two input neurons, two hidden neurons and one output neuron. As we have already chosen the activation function, we now only have to decide how to implement the thresholds. In section 5 we did not mention them. This was not necessary, since we will show here that they are easily treated as ordinary weights. We add a special neuron to both the input and the hidden layer and we de ne the state of this neuron equal to 1. This neuron therefore takes no input from previous layers, since they would have no impact anyway. The weight of a synapse between 19
  20. 20. this special neuron and one in the next layer then acts as the threshold for this neuron. When the activation value for a neuron is computed, it now looks like: X k X k+1 hj = wi j si + wi+1 j 1 = wi j si i=1 i=1 Neuron k + 1 is the special neuron that always has a state equal to 1. In gure 13, we give an example of such a net. x1 x2 1 i1 i2 1 o Figure 13: A (2,2,1) neural net with weights as thresholds This approach enables us to implement the network by using the techniques from section 5, without paying special attention to the thresholds. 6.3 An algorithm: train221.m Given the above-mentioned choices and the explicit method described in section 5, we can now implement a training function for the given situation. Appendix A.3 gives the source of train221.m. This function is used as follows: WH,WO,E] = train221(Wh,Wo,x,y,eta) where the inputs Wh and Wo represent the current weights in the network, (x,y) is a training input-output pair and eta is the learning parameter. The outputs WH and WO are the updated weight matrices and E is the error, as computed before the update. 20
  21. 21. 7 Application I: the XOR-gate 7.1 Results and performance We will now use the algorithm to solve the XOR-gate problem. First, we de ne our training set: ST = (0 0 0) (0 1 1) (1 0 1) (1 1 0) The elements of this set are given as input to the training algorithm introduced in the previous section. This is done by a special m- le, which also stores the error terms. These error terms enable us to analyse the training behaviour of the net. In the rest of the section, we will describe several phenomena, using the information the error terms give us. When looking at the performance of the net, we can look, for instance, at the error of the net on an input-ouput pair of the training set, (xi yi ): Ei = (yi ; oi )2 with yi as the expected output and oi as the output of the net with xi as in- put. A measure of performance on the entire training set is the Sum of Squared Errors(SSE): X SSE = Ei i Clearly, the SSE is an upper bound for every Ei. We will use this frequently when examining the net's performance. If we want the error on every training set element to converge to zero, we just compute the SSE and check that it does this. Now we will have a rst look at the results of training the net on ST . Figure 14 shows some of the results: #iters E1 E2 E3 E4 SSE 0.2 100 0.0117 0.1694 0.1077 0.4728 0.7615 200 0.0003 0.0105 0.0110 0.0009 0.0226 300 2:6 10;5 0.0032 0.0032 0.0001 0.0065 400 8:4 10;6 0.0018 0.0018 2:3 10;5 0.0036 0.4 100 0.0037 0.0373 0.0507 0.0138 0.1055 200 0.0004 0.0031 0.0032 0.0082 0.0149 300 2:3 10;6 0.0013 0.0013 0.0029 0.0055 400 0.0001 0.0008 0.0008 0.0019 0.0036 Figure 14: Some training results As we see in gure 14, both training sessions are succesful, as the SSE becomes very small. We see that with larger, SSE converges to zero faster. This suggests taking large values for . To see if this stategy would be succesful, we repeat the experiment with various values of . In gure 15, the SSE is plotted versus the number of training laps for various . We can see that, for = :2, the SSE converges to zero. For = :4, SSE converges faster, but less smoothly. After 150 trainings, the SSE has a little peak. Taking larger , as suggested above, does not seem very pro table. When is :6, SSE has 21
  22. 22. 2 1.8 1.6 eta = .1 eta = .2 1.4 eta = .4 eta = .6 1.2 eta = .8 1 0.8 0.6 0.4 0.2 0 0 50 100 150 200 250 Figure 15: SSE vs. number of training laps, for various strong oscillations and with = :8, SSE does not even converge to zero. This non-convergence for large can be explained by the error-surface view as used in section 5. We regard the error as function of all the weights. This leads to an error surface on the weights space. We used the gradient of E in this space to minimize the error. expresses the extent to which we change the weights in the direction of the opposite of the gradient. In this way we hope to nd a minimum in this space. If is too large, we can jump over this minimum, approach it from the other side and jump over it again. Thus, we will never reach it and the error will not converge. The conclusion seems to be that the choice of is important. If it is too small, the network will learn very slowly. Larger lead to faster learning, but the network might not reach the optimal solution. We have now trained a network to give the outputs at inputs from the training set. And in this speci c case, these are the only inputs we are interested in. But the net does give outputs on other inputs. Figure 16 shows the output on inputs in the square between the training-set inputs. The graph clearly shows the XOR-gate's outputs on the four corners of the surface. In this case, we were only interested in the training-set elements. What the net does by representing these four states right, is actually only remembering by learning. Later, we will be looking at cases where we are interested in the outputs on inputs outside the training set. Then we are investigating the generalizing capabilities of neural networks. 22
  23. 23. 1 0.8 0.6 0.4 0.2 15 0 10 5 −0.2 0 2 4 6 8 0 10 12 Figure 16: The output of the XOR-net 7.2 Some criteria for stopping training When using neural networks in applications, we will in general not be interested in all the SSE curves etc. In these cases, training is just a way to get a well- performing network, which, after stopping training4, can be used for the required purposes. There are several criteria to stop training. 7.2.1 Train until SSE a A very obvious method is to choose a value a and stop training as soon as the SSE gets below this value. In the examples of possible SSE-curves we have seen sofar, the SSE, for suitable , converges more or less monotonically, to zero. So it is bound to decrease below any value required. Choosing this value depends on the accuracy you demand. As we saw before, the SSE is an upper bound for the Ei, which was the square of y ; o. So if we tolerate a di erence of c between the expected output and the net's output, we want: Ei c2 8i Since SSE is an upperbound, we could use SSE c2 as a stopping criterion. The advantage of this criterion is that you know a lot about the performance of 4 Unless the input-outputrelation is changingthrough time and we will have to continuetraining the new situations. 23
  24. 24. 1.5 SSE 1 0.5 0 0 50 100 150 200 250 300 Figure 17: Stopping training after 158 laps, when SSE 0:1 the net if training is stopped by it. A disadvantage is that training might not stop in some situations. Some situations are too complex for a net to reach the given accuracy. 7.2.2 Finding an SSE-minimum The disadvantage of the previous method suggests another method. If SSE does not converge to zero, we want to at least stop training at its minimum. We might train a net for a very long time, plot the SSE and look for its global minimum. Then we retrain the net under the same circumstances and stop at this optimal point. This is not realistic however, since training in complex situations can take a considerable amount of time and complete retraining would take too long. Another approach is to stop training as soon as SSE starts growing. For small , this might work, since we noticed before that choosing a small leads to very smooth and monotonic SSE-curves. But there is still a big risk of ending up in a local SSE-minimum. Training would stop just before a little peak in the SSE-curve, although training on would soon lead to even better results. The advantages are obvious given a complex situation with non-convergent SSE, we still reach a relatively optimal situation. The disadvantages are obvious too. This method might very well lead to suboptimal stopping, although we can limit this risk by choosing small and maybe combining the two previous techniques: train the network through the rst fase with the rst criterion and then nd a minimum in the second, smoother fase with the second criterion. 7.3 Forgetting Forgetting is another phenomenon we will demonstrate here. So far our training has consisted of subsequently showing the net all the training-set elements an equal number of times. We will show that this is very important. Figure 19 shows the error during training on all the individual training-set elements. 24
  25. 25. 1600 1400 SSE 1200 1000 800 600 400 200 0 0 20 40 60 80 100 120 140 160 180 200 Figure 18: Finding an SSE-minimum It is clear that these functions Ei do not converge monotonically. While the error on some elements decreases, the error on others increases. This suggests that train- ing the net on element a might negatively in uence the performance of the net on another element b. This is the basis for the proces of forgetting. If we stop training an element, training the other elements in uences the performance on this element negatively and causes the net to forget the element. In gure 20 we see the results of the following experiment. We start training the net on element 1. We can see that the performance on the elements 3 and 4 becomes worse. Surprisingly, the performance on element 2 improves along with element 1. After 50 rounds of training, we stop training element 1 and start training the other three elements. Clearly, the error on element 1, E1, increases dramatically and the net ends up performing well on the other three. The net forgot element 1. 25
  26. 26. 1.5 E1 E2 1 E3 E4 SSE 0.5 0 0 20 40 60 80 100 120 140 160 180 200 Figure 19: Training per element eta = .2 2 1.8 E1 E2 1.6 E3 1.4 E4 SSE 1.2 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100 Figure 20: This net forgets element 1 26
  27. 27. 8 Application II: Curve tting In this section we will look at another application of three-layered networks. We will try to use a network to represent a function f : R ! R. We use a network with one input and one output neuron. We will take ve sigmoidal hidden neurons. The output neuron will have a linear activation function, because we want it to have outputs not just in the ;1 1]-interval. The rest of the network is similar to that used in the previous section. Also, the training algorithm is analogous and therefore not printed in the appendices. The matter of choosing was discussed in the previous section and we will let it rest now. For the rest of the section, we will use = :2, which will turn out to give just as smooth and convergent training results as in the previous section. x 1 1 f(x) Figure 21: A (1,5,1)-neural network 8.1 A parabola We will try to t the parabola y = x2 and train the network with several inputs from the 0 1]-interval. The training set we use is: ST = (0 0) (:1 :01) (:5 :25) (:7 :49) (1 1) Training the network shows that the SSE converges to zero smoothly. In this section we will focus less on the SSE and more on the behaviour of the trained network. In the previous section, we wanted the network to perform well on the training set. In this section we want the network to give accurate predictions of the value of x2, with x the input value, and not just on the ve training pairs. So we will not show the SSE graph here. We will plot the networks prediction of the parabola. As we can see, the network predicts the function really well. After 400 training runs we have a fairly accurate prediction of the parabola. It is interesting whether the network also has any knowledge of what happens outside the 0:1]-interval, so whether it can predict the value outside that interval. Figure 24 shows that the network fails to do this. Outside its training set, its performance is bad. 27
  28. 28. 1 prediction actual value 0.8 0.6 0.4 0.2 0 −0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 22: The networks prediction after 100 training runs 1 prediction actual value 0.8 0.6 0.4 0.2 0 −0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 23: The networks prediction after 400 training runs 28
  29. 29. 9 prediction 8 actual value 7 6 5 4 3 2 1 0 −1 0 0.5 1 1.5 2 2.5 3 Figure 24: The network does not extrapolate 29
  30. 30. 8.2 The sine function In this subsection we will repeat the experiment from the previous subsection for the sine function. Our training set is: ST = (0 0) (:8 :71) (1:57 1) (2 :9) (3:14 0) These are the results of training a net on ST : 1 0.8 0.6 0.4 0.2 prediction actual value 0 0 0.5 1 1.5 2 2.5 3 Figure 25: The networks prediction after 400 runs 1 0.8 0.6 0.4 0.2 prediction actual value 0 0 0.5 1 1.5 2 2.5 3 Figure 26: The networks prediction after 1200 runs Obviously, this problem is a lot harder to solve for the network. After 400 runs, the performance is not good yet and even after 1200 runs there is a noticeable di erence between the prediction and the actual value of the sine function. 30
  31. 31. 8.3 Overtraining An interesting phenomenon is that of overtraining. So far, the only measure of performance has been the SSE on the training set, on which the two suggested stopping criteria were based. In this section, we abandon the SSE-approach because we are interested in the performance on sets larger than just the training set. SSE- stopping criteria combined with this new objective of performance on larger sets can lead to overtraining. We give an example. We trained two networks on: ST = (0 0) (1 1) (1:5 2:25) (2 4) (7 49) Here are the training results: 50 40 30 20 10 prediction 0 actual value −10 0 1 2 3 4 5 6 7 Figure 27: Network A predicting the parabola 50 40 30 20 10 prediction 0 actual value −10 0 1 2 3 4 5 6 7 Figure 28: Network B predicting the parabola The question is which of the above networks functions best. With the SSE on ST in mind, the answer is obvious: network B has a very small SSE on the training set. But we mentioned before that we wanted the network to perform on a wider 31
  32. 32. set. So maybe we should prefer network A after all. In fact, network B is just a longer-trained version of network A. We call network B overtrained. Using the discussed methods of stopping training can lead to situations like this, so these criteria might not be satisfactory. 8.4 Some new criteria for stopping training We are looking for a criterion to stop training which avoids the illustrated problems. But the SSE is the only measure of performance we have so far. We will therefore use a combination of the two. As we are interested in the performance of the net on a wider set than just ST , we introduce a reference set SR with input-output elements that are not in ST but represent the area on which we want the network to perform well. Now we de ne the performance of the net as the SSE on SR . When we start training a net with ST , the SSE on SR is likely to decrease, due to the generalizing capabilities of neu- ral networks. As soon as the network becomes overtrained, the SSE on SR increases. Now we can use the stopping criteria from subsection 7.2 with the SSE on SR . We illustrate this technique in the case of the previous subsection. We de ne: SR = (2:5 6:25) (3 9) (4 16) (5 25) (6 36) and we calculate the SSE on both ST and SR . 2500 SSE op St SSE op Sr 2000 1500 1000 500 0 0 20 40 60 80 100 120 140 160 180 200 Figure 29: The SSE on the training set and the reference set Using the old stopping criteria would obviously lead to network B. A stopping cri- terion that would terminate training somewhere close to the minimum of the SSE on SR would lead to network A. 32
  33. 33. In this case, the overtraining is caused by a bad training set ST . It contains all training-pairs on the 0 2] interval and one quite far from that interval. Training the net on SR would have given a much better result. What we wanted to show however was what happens if we keep training too long on a too limited training set: the net indeed does memorize the entries of the training set, but its performance on the neighbourhood of this training set gets worse after longer training. 8.5 Evaluating the curve tting results In the last few sections, we have not been interested in the individual neurons. Instead, we just looked at the entire network and its performance. We did this because we wanted the network to solve the problem. The strong feature of neural networks is that we do not have to divide the problem into subproblems for the individual neurons. It can be interesting though, to look back. We will now analyze the role of every neuron in the two trained curve- tting networks. We start with the 5 hidden neurons. Their output was the tanh over their activation value: ik = tanh(wk x + k ) h h The output neuron takes a linear combination of these 5 tanh-curves: X 5 ok = wlo il + o l=1 = ;X wo tanh(wh x + h ) + 5 o l l l l=1 (23) So the network is trying to t 5 tanh-curves to the tted curve as accurately as possible. We can plot the 5 curves for both the tted parabola and the sine: In this case, only one neuron has a non-trivial output, the other four are more or less constant, a role o could have full lled easily. This leads us to assume that the parabola could have been tted by a (1,1,1) network. The sine function is more complex. Fitting a non-monotonic function with mono- tonics obviously takes more functions. Neuron 2 has a strongly increasing function as output. Because of the symmetry of the sine, we would expect another neuron to have a equally decreasing output function. It appears that this task has been divided between neurons 3 and 4 they both have a decreasing output function and they would probably add up to the symmetric function we expected. The two other neurons have a more or less constant value. For the same reasons as we mentioned with the parabola, we might expect that this problem could have been solved by a (1,3,1) or even a (1,2,1) network. Analyzing the output of the neurons after training can give a good idea of the min- imum size of the network required to solve the problem. And we saw in section 4 33
  34. 34. 0.8 0.6 0.4 0.2 0 −0.2 −0.4 1 2 −0.6 3 −0.8 4 5 −1 6 −1.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 30: The tanh-curves used to t the parabola 1.5 1 1 2 3 0.5 4 5 6 0 −0.5 −1 0 0.5 1 1.5 2 2.5 3 3.5 Figure 31: The tanh-curves used to t the sine that over-dimensioned networks can lose their generalizing capabilities fast. Ana- lyzing the neurons could lead to removing neurons from the network and improving its generalizing capabilities. There is another interesting evaluation method. We could replace the hidden-output results with their Taylor polynomials. This would lead to a polynomial as output function. Question is if this polynomial would be identical to the Taylor polyno- mial of the required output function. Since the functions coincide on an interval, the polynomials would be probably identical for the rst number of coe cients. This could lead to a theory on how big a network needs to be in order to t a function with a given Taylor polynomial. But this would take further research. 34
  35. 35. 9 Application III: Times Series Forecasting In the previous section, we trained a neural network to adopt the input-output re- lation of two familiar functions. We used training pairs (x f (x)). And although performance was acceptabel after small numbers of training, this application had one shortcoming: it did not extrapolate at all. Neural networks will in general perform weakly outside their training set, but a smart input-output choice can overcome these limits. In this section, we will look at time series. A time series is a vector ~t , with xed y distances between subsequent ti . Examples of time series are the daily stock prices, rainfall in the last twenty years and actually every measured quantity over discrete time intervals. Predicting a future value of y, say yt is now done based on for instance yt;1 : : :yt;n but not on t. In this application we will take n = 2 and try to train a network to give valuable predictions of yt . yt-2 yt-1 1 1 yt Figure 32: The network used for TSF We take a network similar to the network we used in the previous section. Only we now have 2 input neurons. The 5 hidden neurons still have sigmoidal activation functions and the output neuron has a linear activation function. 9.1 Results Of course we can look at any function f (x) as a time series. We associate with every entry ti of a vector ~ the value of f (ti ). We will rst try to train the network t on the sine function again. We take ~ = f0 :1 :2 : :: 6:3g and yt = sin(t). Training this network enables us to t predict the sine of t given the sine of the two previous values of t: t ; :1 and t ; :2. But we could also predict the sine of t based on the sines of t ; :3 and t ; :2: these two values gives us a prediction of sin(t ; :1) and thus we can predict sin(t). Of course, basing a prediction on a prediction is less accurate than the prediction based on two actual sine values. The results of the network is plotted in gure 33. Because we trained the net to predict based on previous behaviour, this network 35
  36. 36. 1 predicting 3 deep 0.8 predicting 2 deep predicting 1 deep 0.6 actual value 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 1 2 3 4 5 6 Figure 33: The networks performance after 400 training runs will extrapolate, since the sine-curve's behaviour is periodical. 36
  37. 37. 1 prediction 0.8 actual value 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 1 2 3 4 5 6 7 8 9 10 11 Figure 34: This network does extrapolate 37
  38. 38. Conclusions In this paper we introduced a technique that in theory should lead to good training behaviour for three-layered neural networks. In order to achieve this behaviour, a number of important choices has to be made. 1. the choice of learning parameter 2. the choice of the training set ST 3. the con guration of the network 4. the choice of a stopping criterion In application I, we focussed on measuring the SSE and saw that its behaviour was strongly dependent on the choice of . A small leads to smooth and convergent SSE-curves and therefore to satisfying training results. In our example, = :2 was small enough, but the maximum value of may vary. If an SSE curve is not convergent or is not smooth, one should always try a smaller . Also, choosing ST is crucial. In application II we saw that with a non-representative training set, a trained network will not generalize well. And if you are not only in- terested in performance on ST , just getting the SSE small is not enough. The reference-set-SSE method is a good way to reach a compromise acceptable perfor- mance on ST combined with a reasonable performance on its neighbourhood. Neural networks seem to be a useful technique to learn the relation between data sets in cases where we have no knowledge of what the characteristics of the relation will be. The parameters determining the network's success are not always clear, but there are enough techniques to make these choices. 38
  39. 39. A Source of the used M- les A.1 Associative memory: assostore.m, assorecall.m function w = assostore(S) % ASSOSTORE(S) has as output the synaptic strength % matrix w for the associative memory with contents % the rowvectors of S. p,N]=size(S) w=zeros(N) S=2*S-1 for i=1 : N for j=1 : n w(i,j)=(S(1:p,i)'*S(1:p,j)) end end w=w/N function s= assorecall(sigma,w) % ASSORECALL(g,w) returns the closest contents of % memory w, stored by ASSOSTORE. N,N]=size(w) s=zeros(1,N) sigma=2*sigma -1 s= w*sigma' s=sign(s) s=((s+1)/2)' A.2 An example session < M A T L A B (R) > (c) Copyright 1984-94 The MathWorks, Inc. All Rights Reserved Version 4.2c Dec 31 1994 >> S = 1,1,1,1,0,0,0,0 0,0,0,0,1,1,1,1] S = 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 >> w=assostore(S) >> assorecall( 1,1,0,0,0,0,0,0],w) ans = 1 1 1 1 0 0 0 0 >> assorecall( 0,0,0,0,0,0,1,1],w) ans = 0 0 0 0 1 1 1 1 39
  40. 40. A.3 BPN: train221.m function Wh,Wo,E] = train221(Wh,Wo, x, y, eta) % train221 trains a (2,2,1) neural net with sigmoidal % activation functions. It updates the weights Wh % and Wo for input x and expected output y. eta is % the learning parameter. % Returns the updated matrices and the error E % % Usage: Wh,Wo,E] = train221(Wh,Wo, x, y, eta) %% Computing the networks output %% hi = Wh* x 1] i = tanh(hi) ho = Wo * i 1] o = tanh(ho) E = y - o %% Back Propagation %% % Computing deltas deltao = (1 - o^2) * E deltah1 = (1 - (i(1))^2) * deltao * Wo(1) deltah2 = (1 - (i(2))^2) * deltao * Wo(2) % Updating Outputlayer weights Wo(1) = Wo(1) - eta * deltao * i(1) Wo(2) = Wo(2) - eta * deltao * i(2) Wo(3) = Wo(3) - eta * deltao % Updating Hiddenlayer weights Wh(1,1) = Wh(1,1) - eta * deltah1 * x(1) Wh(1,2) = Wh(1,2) - eta * deltah1 * x(2) Wh(1,3) = Wh(1,3) - eta * deltah1 Wh(2,1) = Wh(2,1) - eta * deltah2 * x(1) Wh(2,2) = Wh(2,2) - eta * deltah2 * x(2) Wh(2,3) = Wh(2,3) - eta * deltah2 40
  41. 41. Bibliography Freeman] James A. Freeman and David M. Skapura, Neural Networks, Algo- rithms, Applications and Programming Techniques, Addison-Wesley, 1991. Muller] B. Muller, J. Reinhardt, M.T. Strickland, Neural Networks, An Intro- duction, Berlin, Springer Verlag, 1995. N rgaard] Magnus N rgaard, The NNSYSID Toolbox, http://kalman.iau.dtu.dk/Projects/proj/nnsysid.html 41