Upcoming SlideShare
×

# A New Learning Method for Single Layer Neural Networks Based on a Regularized Cost Function

1,505 views
1,404 views

Published on

Presentation at IWANN 2003

Published in: Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,505
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
7
0
Likes
0
Embeds 0
No embeds

No notes for slide
• Thank you very much I’m going to present here a new learning method for single layer neural networks based on a regularized cost function
• Let me first make an outline of the main points of this presentation I’m going to start with a little introduction about single layer neural networks Next I’ll explain supervised learning with regularization in this kind of networks And show an alternative loss function that allows to obtain a n analytical solution Finally, I’ll show experimental results, the conclusions and future work
• As we can notice, our supervised learning algorithm is applied to a single layer neural network, with I inputs and J outputs To train the network we have S examples Generally, the activation functions used are non-linear At last , as can be seen, in this kind of networks the outputs are independent one of each other Because the weights related with each output are independent one set from another
• So, in order to simplify the explanation , we’ll work with one output PRESS NEXT KEY So the real outputs of the network are obtained through a non-linear function, where the input is the sum of the inputs by the weights, plus the bias If the error function used is the MSE, as is in our case, then the goal is to obtain the values of the weights and the bias which minimi z es the MSE between the real and the desired outputs
• So adding a regularization term to the cost function, our goal is minimi z e this cost function, which has two terms PRESS NEXT KEY The first term is the loss function, here the MSE, which is the square of difference between the desire output and the real output PRESS NEXT KEY And the second term is the regularization term, weighted by the regularization parameter alpha In our case the regularization term used is weight decay, which tries to smooth the obtained curve To minimi z e this cost function, we can derive both terms with respect to weights and bias, and equating to zero PRESS NEXT KEY The problem is that, in the first term, the weights are inside the non-linear function, so it isn’t guaranteed to have a unique minimum And also these minima can’t be obtained using an analytical method , but an iterative method
• In order to solve this problem, we present here an alternative loss function that is based on the following theorem READ THE THEOREM BRIEFLY
• Roughly speaking, the idea is that minimi z e the error difference in the output is equivalent to minimi z e the error difference before the non-linear function, weighting it by a factor
• So now, applying the theorem, we have the new cost function PRESS NEXT KEY In which the alternative loss function is the MSE, but obtained before the non-linear functions PRESS NEXT KEY And the regularization term, which is the same as in the previous cost function We can notice that now the weights and the bias are outside the non-linear function
• So to minimi z e the new cost function we derive both terms with respect to the weights and the bias, and equating the partial derivatives to zero Obtaining the equations showed in the slide
• We can rewrite the previous system in order to obtain a new system of I+1 by I+1 linear equations, where PRESS NEXT KEY We have the variables, which are the weights and the bias PRESS NEXT KEY The cofficients PRESS NEXT KEY And the independent terms PRESS NEXT KEY So we can use an analytical method to solve this system of equations, obtaining the optima weights and bias This implies that the training is doing very fast with a low computational cost Also, this system of equations ha s an unique minimum, except for degenerated systems At last, we can do an incremental learning And even a parallel learning, where the training process is divided in several distributed neural networks, and the results are merged to obtain the global training In both cases, only the coefficients matrix and the independent terms vector must be stored
• In order to probe our algorithm, we have applied it to a classification problem and to a regression problem PRESS NEXT KEY In both cases, we have used the logistic function in the neural functions PRESS NEXT KEY And the parameter alpha has been constrained to the interval [0, 1]
• The first problem, a classification problem, has been extracted from the KDD Cup 99 competition Each sample summarized a connection between two hosts, and i t’ s formed by 41 inputs The goal is to classify each sample in two classes: attack or normal connection We have 30.000 samples for training, and almost 5.000 for testing
• In order to study the influence of the training set size and the regularization parameter, we have generated several training sets For doing this , we have generated an initial training set formed by 100 samples. Each new training set is formed adding to the previous set 100 new samples, up to 2500 samples By this way we have 25 training sets For each training set, several neural networks have been trained with differents alphas, from 0 to 1, in steps of 0.005 All this process has been repeated 12 times, to obtain a better estimation of the true error Finally, the regularization parameter that provides the minimum test classification error is chosen
• As it can be seen in the figure, using regularization in all cases produces a better results that without it Mainly in small training set sizes In order to check that really this difference is statistically significant, we have applied a statistical test, confirming this fact PRESS NEXT KEY Also, we only need 400 samples to stabilize the error using regularization, while without regularization we need 700 samples
• The other problem is a regression problem, specifically the Box-Jenkins problem The problem consists in estimate the concentration of CO2 in a gas furnace at a time instant from the 4 previous concentrations and the 6 previous methane flow rates
• Like in t h e previous problem, we have generated several training set sizes Initially we have done a 10-fold cross validation, using 261 samples for training and 29 for testing In each validation round, several training sets have been generated, from 9 to 261 examples, in steps of 9 samples, using the same process as in the case of the intrusion detection problem Finally, for each training set, several neural networks have been trained and tested, varying alpha from 0 to 1 in steps of 0.001 In order to obtain a better estimation of the true error, mainly with small training sets, we have repeated the previous process 1 0 times And at last, the alpha that produces the minimum normalized MSE has been chosen
• The results are showed in the figure Though it seems that using regularization i s worse than not use it, statistically there is no difference Except for small training sets
• In this case the neural network performs very well, and using regularization do es n’t enhance the results In order to check the generalization capability of regularization in the presence of noisy data, we have added two normal random noise One with a standard desviation that is the half of the standard desviation of the original time series (gamma 0.5) And the other with the same standard desviation (gamma 1)
• We can notice the results together with the previous results As we can see, using regularization with nois y data improves the results In fact, in both cases there is a statistically difference using regularization and without regularization In the case of gamma 0.5, this difference only exists until a training set size of 225 samples PRESS NEXT KEY If we search for the smallest training set from which the error stabilizes With gamma 0.5 this size is 198, either using regularization or not But with gamma 1, that is, with a more noisy dat a , this size is 189 with regularization, and 207 without regularization
• As conclusions , we have proposed a new supervised learning method for single layer neural networks using regularization Among its features, we can remark that it allows to obtain the global optimum in an analytical way and, hence, faster than the current iterative methods It allows incremental learning and distributed learning And, due to the regularization term, a better generalization capability, mainly with small training sets or with noisy data We have applied it to two kind of problems, a classification problem and a regression problem, obtaining generally better results As future work, an analytical method to obtain the regularization parameter is being analyzed
• Thank you very much
• ### A New Learning Method for Single Layer Neural Networks Based on a Regularized Cost Function

1. 1. A New Learning Method for Single Layer Neural Networks Based on a Regularized Cost Function Juan A. Suárez-Romero Óscar Fontenla-Romero Bertha Guijarro-Berdiñas Amparo Alonso-Betanzos Laboratory for Research and Development in Artificial Intelligence Department of Computer Science, University of A Coruña, Spain
2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Supervised learning + regularization </li></ul><ul><li>Alternative loss function </li></ul><ul><li>Experimental results </li></ul><ul><li>Conclusions and Future Work </li></ul>
3. 3. Single layer neural network <ul><li>I inputs </li></ul><ul><li>J outputs </li></ul><ul><li>S samples </li></ul>
4. 4. Single layer neural network
5. 5. Cost function <ul><li>Supervised learning + regularization </li></ul>MSE Regularization term (Weight Decay) Non-linear neural functions  Not guaranteed to have a unique minimum (local minima)
6. 6. Alternative loss function <ul><li>Theorem Let x js be the j-th input of a one-layer neural network, d js , y js be the j-th desired and actual outputs, w ij , b j be the weights, and f , f -1 , f´ be the nonlinear function, its inverse and its derivative. Then to minimize L j is equivalent to minimize, up to the first order of the Taylor series expansion, the below alternative loss function: </li></ul><ul><ul><li>where: </li></ul></ul>
7. 7. Alternative loss function
8. 8. Alternative cost function <ul><li>Supervised learning + regularization </li></ul>Alternative MSE Regularization term (Weight Decay)
9. 9. Alternative cost function <ul><li>Optimal weights and bias can be obtained deriving it with respect to the weights and the bias of the network and equating the partial derivatives to zero </li></ul>
10. 10. Alternative cost function <ul><li>We can rewrite previous system to obtain a system of (I+1) ×(I+1) linear equations </li></ul><ul><li>Advantages </li></ul><ul><ul><li>Solved using a system of linear equations  fast training with low computational cost </li></ul></ul><ul><ul><li>Convex function  unique minimum </li></ul></ul><ul><ul><li>Incremental + parallel learning  only the coefficients matrix and the independent terms vector must be stored </li></ul></ul>Variables Independent terms Coefficients
11. 11. Experimental results <ul><li>Two kind of problems </li></ul><ul><li>Intrusion Detection </li></ul><ul><ul><li>Classification problem </li></ul></ul><ul><li>Box-Jenkins time series </li></ul><ul><ul><li>Regression problem </li></ul></ul>
12. 12. Intrusion Detection problem <ul><li>KDD’99 Classifier Learning Contest </li></ul><ul><li>Two-class classification problem: attack and normal connections </li></ul><ul><li>Each sample formed by 41 high-level features </li></ul><ul><li>300 00 samples for training </li></ul><ul><li>4996 samples for testing </li></ul>
13. 13. Intrusion Detection problem <ul><li>In order to study the influence of training set size and regularization parameter </li></ul><ul><ul><li>Initial training set of 100 samples </li></ul></ul><ul><ul><li>Next training set is obtained adding 100 new samples to previous set, up to 2500 samples </li></ul></ul><ul><ul><li>For each training set, several neural networks have been trained, with  from 0 (no regularization) to 1, in steps of 5 × 10 -3 </li></ul></ul><ul><li>In order to obtain a better estimation of the true error </li></ul><ul><ul><li>Repeat this process 12 times with different training set </li></ul></ul><ul><li>The  with minimum test classification error is chosen </li></ul>
14. 14. Intrusion Detection problem 700 400
15. 15. Box-Jenkins problem <ul><li>Regression problem </li></ul><ul><li>Estimate CO 2 concentration in a gas furnace from methane flow rate </li></ul><ul><li>Predict y(t) from {y(t-1), y(t-2), y(t-3), y(t-4), u(t-1), u(t-2), u(t-3), u(t-4), u(t-5), u(t-6)} </li></ul><ul><li>290 samples </li></ul>
16. 16. Box-Jenkins problem <ul><li>In order to study the influence of training set size and regularization parameter </li></ul><ul><ul><li>10-fold cross validation (261 examples for training and 29 for testing) </li></ul></ul><ul><ul><li>For each validation round, generate several training sets, from 9 to 261 examples, in steps of 9 examples </li></ul></ul><ul><ul><li>For each previous data set, train and test several neural networks varying  from 0 (no regularization) to 1 in steps of 10 -3 </li></ul></ul><ul><li>In order to obtain a better estimation of the true error, mainly with small training sets </li></ul><ul><ul><li>Repeat validation 10 times with different composition of training sets </li></ul></ul><ul><li>The  with minimum NMSE error is chosen </li></ul>
17. 17. Box-Jenkins problem
18. 18. Box-Jenkins problem <ul><li>There is no difference using regularization (except for small training sets) </li></ul><ul><li>The neural network performs well, and using regularization do not enhance results </li></ul><ul><li>Add normal random noise with  =  t , where  t is standard desviation from original time series, and  {0.5, 1} </li></ul>
19. 19. Box-Jenkins problem 198 189 207
20. 20. Conclusions and Future Work <ul><li>A new supervised learning method for single layer neural networks using regularization has been introduced </li></ul><ul><ul><li>Global optimum </li></ul></ul><ul><ul><li>Fast training </li></ul></ul><ul><ul><li>Incremental and parallel learning </li></ul></ul><ul><ul><li>Better generalization capability </li></ul></ul><ul><li>Applied to two problems: classification and regression </li></ul><ul><ul><li>Regularization generally obtains a better solution, mainly with small training sets or nois y data </li></ul></ul><ul><li>As future work, an analytical method to obtain the regularization parameter is being analyzed </li></ul>
21. 21. A New Learning Method for Single Layer Neural Networks Based on a Regularized Cost Function Juan A. Suárez-Romero Óscar Fontenla-Romero Bertha Guijarro-Berdiñas Amparo Alonso-Betanzos Laboratory for Research and Development in Artificial Intelligence Department of Computer Science, University of A Coruña, Spain T h a n k y o u f o r y o u r a t t e n t i o n !