SlideShare a Scribd company logo
International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019
DOI :10.5121/ijscai.2019.8103 37
GRADIENT OMISSIVE DESCENT
IS A MINIMIZATION ALGORITHM
Gustavo A. Lado and Enrique C. Segura
Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires Cyprus
ABSTRACT
This article presents a promising new gradient-based backpropagation algorithm for multi-layer
feedforward networks. The method requires no manual selection of global hyperparameters and is capable
of dynamic local adaptations using only first-order information at a low computational cost. Its semi-
stochastic nature makes it fit for mini-batch training and robust to different architecture choices and data
distributions. Experimental evidence shows that the proposed algorithm improves training in terms of both
convergence rate and speed as compared with other well known techniques.
KEYWORDS
Neural Networks, Backpropagation, Cost Function Minimization
1. INTRODUCTION
1.1. MOTIVATION
When minimizing a highly non-convex cost function by means of some gradient descent strategy,
several problems may arise. If we consider a single direction in the search space, there could
appear local minima [5], where it is necessary first to climb up in order to have the possibility to
reach a better or global minimum, or plateaus [16], where partial derivatives have very small
values even though being far from a minimum. Of course, in a multidimensional space these
problems can combine to produce other ones, such as partial minima (ravines) [1], where for
some direction (subspace) there exists a local minimum and for other direction there is a plateau,
and the saddle points [3], where partial derivatives have small values since a partial minimum
appears in a direction and a maximum appears in another direction. Even if considering a single
search direction it can be easily seen that an algorithm using only gradient information might be
in trouble when looking for a good minimum.
1.2. BACKGROUND
Approaches such as Stochastic Gradient Descent (SGD) [15] or Simulated Annealing (SA) [14, 7]
introduce a random factor in the search which permits, to some extent, overcome the problem of
local minima, whilst other methods as Gradient Descent with Momentum (GDM) [11] or adaptive
parameters [2] allow for speeding up the search in the case of plateaus. When combined with
incremental or mini-batch training, these techniques can be effective to some degree even with
ravines and saddle points, but they use global hyperparameters that are generally problem-
dependent.
International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019
38
Other techniques such as Conjugate Gradient Descent (CGD) [12], or others which incorporate
second derivative information, as the Newton Method and the Levenberg-Marquardt algorithm
[17], can speed up the search through a better analysis of the cost function, but are in general
computationally expensive and highly complex, turning more difficult its implementation and
use.
Finally, more recent algorithms such as Quick Back-Propagation (QBP) [17], Resilient Back-
Propagation (RBP) [13] and Adaptive Momentum Estimation (AME) [19, 4] use adaptive
parameters independent for each search direction, hence they do not require selection of global
hyperparameters and can adapt its strategy as to cope with different kinds of problems. However,
even in these situations, they can get stuck in bad solutions (poor minima). In addition, all these
techniques require batch training, which in many cases may be not only inconvenient but
impossible or highly impractical.
The main difficulty in the search of a good minimum (global or not) is that these algorithms try to
take good decisions solely on the base of local information and values that, in many cases, are the
result of estimations of estimations. In other words: it is difficult to take intelligent decisions with
incomplete and inaccurate information.
Perhaps a better approach would be just to try that by the end of each epoch the probability of
cost decrease be greater than that of increase and, in the case that it increased, it have a reasonably
high probability of escaping to another basin of the cost surface. Thus, instead of trying to ensure
convergence through analysis of the cost function, the matter is to warrant statistical tendency to a
good minimum.
2. THE PROPOSED ALGORITHM
Our method borrows in part certain ideas from other techniques. For example, from SGD the
introduction of a random element in the descent; from CGD the exploration of the cost function -
though nonlinear in this case- and from RBP the use of the sign of the gradient as indicator of the
search direction, omitting its magnitude [12,13, 14, 15, 17]. The main idea is that during training
the search moves across the cost surface with random steps within a radius of search, so as to
statistically converge to the minimum; and this radius decreases as the solution is approached.
2.1. DESCRIPTION
In each epoch the dataset is divided into mini-batches. A partial estimation of the cost function E
is made for each one of those mini-batches, and the gradient is computed by standard error
backpropagation. Then weights are modified in a random magnitude, bounded by a value R, but
using the direction (sign) of the △W previously computed. At the end of each epoch, the cost is
compared to that of earlier steps and, should it correspond, necessary adjustments are made on R.
If we look at each mini-batch as a sample of the dataset, we can consider each epoch as a
Montecarlo estimation of the cost function. Similarly, the fact that the search radius gradually
decreases suggests a resemblance to simulated annealing [10]. However, to impose a fixed
reduction of R as a function of the number of epochs or of the cost is impractical due to the
variability between different problems. For this reason a strategy was chosen where adjustments
to R are dynamically applied, depending on the relative cost [9]. This turns it more robust, but the
adjustments must be small, in the order of 1%, since by the very nature of the estimation of E,
oscillations might occur.
International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019
39
Then the basic idea for modifying R can be expressed as follows: if the cost increased with regard
to the last epoch, reduce the radius of search. This is relatively effective in a variety of scenarios,
but these results can still be improved by using some local parameters instead of global ones [18].
For example, at the end of each epoch an estimation can be made of the individual contribution of
each unit to the total cost, that is, the cost by unit (Eu). Similarly, instead of using a global radius
of search R, we can have an individual radius (Ru) for each unit.
Under these conditions the rule for adjustment of search radii is modified as follows: if the total
cost increased, reduce search radius for units with higher error and increase radius for the others
(see Algorithm 1). Then, in this strategy the radius of search can be seen as the level of resolution
of the cost function. If this function cannot decrease any more, the radius is reduced in order to
look at the cost surface in more detail [2]. Increasing the radius for those units whose cost grew
up helps to maintain an equilibrium and possibly to escape from a bad solution.
Algorithm 1. The Gradient Omissive Descent algorithm.
initialize W, R, E, t
while (E > min_error)∧(t < max_epoch):
E[t] ← 0
for every batch b:
W ← W + U(0, R)|W| × sgn(∂E(W, b)∂W)
E[t] ← E[t] + E(W, b)
dE ← E[t] − E[t − 1]
if dE > 0:
for every unit u:
if dEu > 0:
Ru ← Ru × 0.99
else:
Ru ← Ru × 1.01
t ← t + 1
3. EXPERIMENTS
3.1. METHODS
In all experiments our technique, Gradient Omissive Descent (GOD), was compared with three
other algorithms: SGD, GDM, RBP. These were chosen in order to cover the alternatives of
incremental training, mini-batch and batch respectively. In future work we will also include other
approaches such as CGD, though the results might not be easy to compare, and AME, for which
we do not have yet a properly tested implementation.
The stop condition in all simulations was either the completion of a maximum number of epochs
or having reached an error under a minimum value. This error was computed as the average of all
the errors in the training dataset. Since data were artificially generated, it was possible to keep the
same dataset size through all the tests.
International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019
40
As a measure of the efficacy of the algorithms, and since there exists no unique stop condition,
the quantity epoch⁄max_epoch × error⁄min_error was used, where epoch and error correspond to
final values, and max_epoch and min_error are the values for the stop condition. Thus, lower
values indicate a higher efficacy and, in particular, quantities less than 1 imply that a solution has
been obtained.
Experiments were realized on two kinds of problems, typical for evaluation of this class of
algorithms: parity and auto-encoding. In both cases, additional constraints were introduced in
order to have a better control and thus to adjust the level of difficulty [8]. Final results were
computed as the average of several simulations under the same constraints but with different
datasets.
3.1.1. PARITY
The parity problem, i.e. multivariate XOR, is defined as the result of verifying if the number of
1’s in a list of binary variables is even. To generate data for these experiments we use variables xi
∈ { − 1, 1} instead of binary, defining the function even as:
Where u(x) = 1 if x = 1, and equals to zero otherwise.
In its simplest form, this problem consists of mapping n inputs into a single output which
indicates the parity. What makes it particularly interesting is the fact that changing the value of
just one input variable produces a change in the output. In our experiments we used 10 input
variables [x0..x9] and 10 output variables [z0..z9], where zi = even(x0..xi). That is, each output
corresponds to the parity on a portion of the input variables. This turns the problem more
difficult, as the unique hidden layer must find a representation satisfying simultaneously all
outputs.
3.1.2. AUTO-ENCODING WITH DEPENDENT VARIABLES
One of the critical factors in the efficacy of an auto-encoding is how independent are the data
between each other. If the dataset is built up with completely independent variables and of the
same variance, it is not possible to make an effective reduction of dimension [6].
For this reason, the datasets for the experiments were constructed with different proportions
between independent and dependent variables. Independent variables were drawn from a random
uniform distribution between -1 and 1, and the dependent ones were random linear combinations
of the former. Final values result from applying the sign function in order to work also in this case
with values in { − 1, 1}.
However, in real situations, data are not often equally correlated between them. This fact was also
taken into account, splitting the dataset into several classes. Each class not only uses different
correlations for dependent variables, but also assigns different positions to the independent ones.
International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019
41
3.2. RESULTS
This section presents comparisons between the results obtained with the different algorithms
under consideration for the problems described above. For each method, the parameters that
produced the better results were chosen. In all simulations the stop condition was 3000 epochs or
an average error less than 10 − 3. These settings had the purpose to ensure that all the algorithms
had the possibility to converge.
The weights were independently initialized at random from an uniform distribution over the
interval { − N − 12, N − 12}, being N the number of neurons in the previous layer. Initial values for
R were computed as 1
⁄N × M, where N is the number of units in the previous layer and M is the
number of units in the same layer. Each mini-batch, both for GDM and for GOD, consisted of a
10% of the elements of the dataset, chosen at random without repetition.
3.2.1. PARITY
For this group of experiments the network had a single hidden layer of 40 units. The training set
included all the 210 possible combinations for the input. For SGD the learning rate η was 0.01, for
GDM η = 0.001 and the momentum factor β = 0.5. For RBP, η + = 1.2 and η − = 0.5 in all cases,
as proposed in [13].
Figure 1. Training results of several algorithms for multiple parity.
Figure 1 shows the time evolution of the sum of square errors by epoch for different algorithms.
These results correspond to a typical run of training, since the average of several simulations
International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019
42
would not show the differences in the behaviour of different methods. Although the inner
stochasticity of the training process introduces large variations in the way of descent for each
algorithm, the order of convergence is markedly consistent. It can be noted that the cost decreases
much more rapidly with the proposed method.
3.2.2. AUTO-ENCODING WITH DEPENDENT VARIABLES
A network with 100 input and 100 output neurons was used, while the size of the hidden layer
was varied between 70, 80 and 90 units. The training set had 1000 instances, from which 25%
were set apart for validation. In view of the number of simulations under the same conditions, the
validation method used was holdout, since in these circumstances there is no need for more
partitions of cross-validation. The parameters were: for SGD η = 0.001; for GDM η = 0.001, β =
0.5.
Figure 2 shows the variations in efficacy of different algorithms as changing the proportion of
independent variables and the number of units in the hidden layer. The three parts of the figure
correspond to the variation of number of units in the hidden layer, indicated on the left axis (a
smaller value corresponds to a more difficult problem); percentages on the horizontal axis show
the proportion of dependent variables (a smaller value also corresponds to a more difficult
problem), and values on vertical axes indicate the efficacy of the algorithms computed as
previously defined and averaged over several runs. In these simulations, 5 classes were included;
results for other number of classes are not displayed here since they show a not quite pronounced
variation, but consistent with the problem complexity.
Figure 2. Comparison of the performance of different algorithms
in problems of auto-encoding with correlated data.
International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019
43
It is worth noting not only how the efficacy of the proposed technique varies with problems of
different complexity, but also how varies the relative behaviour of all compared methods.
Depending on the complexity of the problem, the quality of our solution relatively changes with
regard to other algorithms, but it can also be seen a clear trend to a faster convergence.
4. CONCLUSIONS
In the present work an optimization technique was introduced with the ability to find minima or
quasi-minima of cost functions for non-convex, highly dimensional problems. The proposed
algorithm manages successfully some classical difficulties reported in the literature, such as local
minima, plateaus, ravines and saddle points. In our view, the method is an advance regarding the
previous ones, from which it also takes inspiration, in the form of an exploration of the cost
function, in a semi-random descent, and omitting the magnitude of the gradient.
The proposed algorithm works efficiently on a wide range of problems, featuring convergence
rate and speed higher than other common methods. We also note that in certain cases, for example
when the problem is very simple, less sophisticated algorithms as SGD can be more effective, but
in its turn have the disadvantage of requiring a parameter setting. Similarly, in problems which
can be intractable -due for example to architecture limitations- our solution is comparable to RPB
but, unlike it, our method can work in mini-batches.
ACKNOWLEDGEMENTS
The idea for this algorithm was fundamentally conceived while L. G. was working with Professor
Verónica Dahl in Vancouver, BC. Special thanks to her for the invitation and to the Emerging
Leaders in the Americas Program (ELAP) of the Canada’s government for making that visit
possible.
REFERENCES
[1] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun. The loss
surfaces of multilayer networks. Artificial Intelligence and Statistics:192-204, 2015.
[2] Christian Darken, Joseph Chang, John Moody. Learning rate schedules for faster stochastic
gradient search. Neural Networks for Signal Processing [1992] II., Proceedings of the 1992 IEEE-
SP Workshop:3-12, 1992.
[3] Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua
Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization. Advances in neural information processing systems:2933-2941, 2014.
[4] Timothy Dozat. Incorporating nesterov momentum into adam. , 2016.
[5] Marco Gori, Alberto Tesi. On the problem of local minima in backpropagation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 14(1):76-86, 1992.
[6] Geoffrey E. Hinton, Ruslan R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. science, 313(5786):504-507, 2006.
International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019
44
[7] Scott Kirkpatrick, C. Daniel Gelatt, Mario P. Vecchi, others. Optimization by simulated annealing.
science, 220(4598):671-680, 1983.
[8] Steve Lawrence, C. Lee Giles, Ah Chung Tsoi. What size neural network gives optimal
generalization?: Convergence properties of backpropagation. 1998.
[9] George D. Magoulas, Michael N. Vrahatis, George S. Androulakis. Improving the convergence of
the backpropagation algorithm using learning rate adaptation methods. Neural Computation,
11(7):1769-1796, 1999.
[10] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, Edward
Teller. Equation of state calculations by fast computing machines. The journal of chemical
physics, 21(6):1087-1092, 1953.
[11] Miguel Moreira, Emile Fiesler. Neural networks with adaptive learning rate and momentum terms.
Idiap, 1995.
[12] Martin Fodslette Møller. A scaled conjugate gradient algorithm for fast supervised learning.
Neural networks, 6(4):525-533, 1993.
[13] Martin Riedmiller, Heinrich Braun. A direct adaptive method for faster backpropagation learning:
The RPROP algorithm. Neural Networks, 1993., IEEE International Conference on:586-591,
1993.
[14] Herbert Robbins, Sutton Monro. A stochastic approximation method. The annals of mathematical
statistics:400-407, 1951.
[15] David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams. Learning representations by back-
propagating errors. nature, 323(6088):533, 1986.
[16] Richard S. Sutton. Two problems with backpropagation and other steepest-descent learning
procedures for networks. Proc. 8th annual conf. cognitive science society:823-831, 1986.
[17] Bogdan M. Wilamowski. Neural network architectures and learning. Industrial Technology, 2003
IEEE International Conference on, 1:-1, 2003.
[18] Z. Zainuddin, N. Mahat, Y. Abu Hassan. Improving the Convergence of the Backpropagation
Algorithm Using Local Adaptive Techniques. International Conference on Computational
Intelligence:173-176, 2004.
[19] Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.
AUTHORS
Gustavo Lado is a graduated student from the University of Buenos Aires in
Computing Science. His previous research interests include the simulation of the
human visual system with artificial neural networks. Part of this work was recently
published in the book "La Percepción del Hombre y sus Máquinas". He is currently
working in neural networks applied to the natural language processing and cognitive
semantics as part his Ph.D.
International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019
45
Enrique Carlos Segura was born in Buenos Aires, Argentina. He received the M.Sc.
degrees in Mathematics and Computer Science in 1988 and the Ph.D. degree in
Mathematics in 1999, all from University of Buenos Aires.
He was Fellow at the CONICET (National Research Council, Argentina) and at the
CONEA (Atomic Energy Agency, Argentina). He had also a Thalmann Fellowship to
work at the Universitat Politécnica de Catalunya (Spain) and a Research Fellowship at
South Bank University (London). From 2003 to 2005 he was Director of the
Department of Computer Science of the University of Buenos Aires, where he is at present a Professor and
Resarcher.His main areas of research are Artificial neural Networks -theory and applications- and, in
general, Cognitive Models of Learning and Memory.

More Related Content

What's hot

The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
theijes
 
Sequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_modelsSequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_models
YoussefKitane
 
Grasp approach to rcpsp with min max robustness objective
Grasp approach to rcpsp with min max robustness objectiveGrasp approach to rcpsp with min max robustness objective
Grasp approach to rcpsp with min max robustness objective
csandit
 
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGA HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
cscpconf
 
IRJET- Novel based Stock Value Prediction Method
IRJET- Novel based Stock Value Prediction MethodIRJET- Novel based Stock Value Prediction Method
IRJET- Novel based Stock Value Prediction Method
IRJET Journal
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
The Statistical and Applied Mathematical Sciences Institute
 
A Non Parametric Estimation Based Underwater Target Classifier
A Non Parametric Estimation Based Underwater Target ClassifierA Non Parametric Estimation Based Underwater Target Classifier
A Non Parametric Estimation Based Underwater Target Classifier
CSCJournals
 
Av33274282
Av33274282Av33274282
Av33274282
IJERA Editor
 
IRJET- Artificial Algorithms Comparative Study
IRJET-  	  Artificial Algorithms Comparative StudyIRJET-  	  Artificial Algorithms Comparative Study
IRJET- Artificial Algorithms Comparative Study
IRJET Journal
 
Statistical database, problems and mitigation
Statistical database, problems and mitigationStatistical database, problems and mitigation
Statistical database, problems and mitigation
Bikrant Gautam
 
Fast and Bootstrap Robust for LTS
Fast and Bootstrap Robust for LTSFast and Bootstrap Robust for LTS
Fast and Bootstrap Robust for LTSdessybudiyanti
 
Quantitative Risk Assessment - Road Development Perspective
Quantitative Risk Assessment - Road Development PerspectiveQuantitative Risk Assessment - Road Development Perspective
Quantitative Risk Assessment - Road Development PerspectiveSUBIR KUMAR PODDER
 
Heuristic based adaptive step size clms algorithms for smart antennas
Heuristic based adaptive step size clms algorithms for smart antennasHeuristic based adaptive step size clms algorithms for smart antennas
Heuristic based adaptive step size clms algorithms for smart antennas
csandit
 
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...
Editor IJCATR
 
quantitative-risk-analysis
quantitative-risk-analysisquantitative-risk-analysis
quantitative-risk-analysisDuong Duy Nguyen
 
A Novel Forecasting Based on Automatic-optimized Fuzzy Time Series
A Novel Forecasting Based on Automatic-optimized Fuzzy Time SeriesA Novel Forecasting Based on Automatic-optimized Fuzzy Time Series
A Novel Forecasting Based on Automatic-optimized Fuzzy Time Series
TELKOMNIKA JOURNAL
 
Evolution of regenerative Ca-ion wave-packet in Neuronal-firing Fields: Quant...
Evolution of regenerative Ca-ion wave-packet in Neuronal-firing Fields: Quant...Evolution of regenerative Ca-ion wave-packet in Neuronal-firing Fields: Quant...
Evolution of regenerative Ca-ion wave-packet in Neuronal-firing Fields: Quant...
AM Publications,India
 
Historical Simulation with Component Weight and Ghosted Scenarios
Historical Simulation with Component Weight and Ghosted ScenariosHistorical Simulation with Component Weight and Ghosted Scenarios
Historical Simulation with Component Weight and Ghosted Scenariossimonliuxinyi
 

What's hot (19)

The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Sequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_modelsSequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_models
 
Grasp approach to rcpsp with min max robustness objective
Grasp approach to rcpsp with min max robustness objectiveGrasp approach to rcpsp with min max robustness objective
Grasp approach to rcpsp with min max robustness objective
 
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGA HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
 
IRJET- Novel based Stock Value Prediction Method
IRJET- Novel based Stock Value Prediction MethodIRJET- Novel based Stock Value Prediction Method
IRJET- Novel based Stock Value Prediction Method
 
C070409013
C070409013C070409013
C070409013
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
A Non Parametric Estimation Based Underwater Target Classifier
A Non Parametric Estimation Based Underwater Target ClassifierA Non Parametric Estimation Based Underwater Target Classifier
A Non Parametric Estimation Based Underwater Target Classifier
 
Av33274282
Av33274282Av33274282
Av33274282
 
IRJET- Artificial Algorithms Comparative Study
IRJET-  	  Artificial Algorithms Comparative StudyIRJET-  	  Artificial Algorithms Comparative Study
IRJET- Artificial Algorithms Comparative Study
 
Statistical database, problems and mitigation
Statistical database, problems and mitigationStatistical database, problems and mitigation
Statistical database, problems and mitigation
 
Fast and Bootstrap Robust for LTS
Fast and Bootstrap Robust for LTSFast and Bootstrap Robust for LTS
Fast and Bootstrap Robust for LTS
 
Quantitative Risk Assessment - Road Development Perspective
Quantitative Risk Assessment - Road Development PerspectiveQuantitative Risk Assessment - Road Development Perspective
Quantitative Risk Assessment - Road Development Perspective
 
Heuristic based adaptive step size clms algorithms for smart antennas
Heuristic based adaptive step size clms algorithms for smart antennasHeuristic based adaptive step size clms algorithms for smart antennas
Heuristic based adaptive step size clms algorithms for smart antennas
 
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...
 
quantitative-risk-analysis
quantitative-risk-analysisquantitative-risk-analysis
quantitative-risk-analysis
 
A Novel Forecasting Based on Automatic-optimized Fuzzy Time Series
A Novel Forecasting Based on Automatic-optimized Fuzzy Time SeriesA Novel Forecasting Based on Automatic-optimized Fuzzy Time Series
A Novel Forecasting Based on Automatic-optimized Fuzzy Time Series
 
Evolution of regenerative Ca-ion wave-packet in Neuronal-firing Fields: Quant...
Evolution of regenerative Ca-ion wave-packet in Neuronal-firing Fields: Quant...Evolution of regenerative Ca-ion wave-packet in Neuronal-firing Fields: Quant...
Evolution of regenerative Ca-ion wave-packet in Neuronal-firing Fields: Quant...
 
Historical Simulation with Component Weight and Ghosted Scenarios
Historical Simulation with Component Weight and Ghosted ScenariosHistorical Simulation with Component Weight and Ghosted Scenarios
Historical Simulation with Component Weight and Ghosted Scenarios
 

Similar to GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM

Comparative study of optimization algorithms on convolutional network for aut...
Comparative study of optimization algorithms on convolutional network for aut...Comparative study of optimization algorithms on convolutional network for aut...
Comparative study of optimization algorithms on convolutional network for aut...
IJECEIAES
 
Modified Multiphase Level Set Image Segmentation Search for Energy Formulatio...
Modified Multiphase Level Set Image Segmentation Search for Energy Formulatio...Modified Multiphase Level Set Image Segmentation Search for Energy Formulatio...
Modified Multiphase Level Set Image Segmentation Search for Energy Formulatio...
International Journal of Science and Research (IJSR)
 
Memory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital PredistorterMemory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital Predistorter
IJERA Editor
 
A Tabu Search Heuristic For The Generalized Assignment Problem
A Tabu Search Heuristic For The Generalized Assignment ProblemA Tabu Search Heuristic For The Generalized Assignment Problem
A Tabu Search Heuristic For The Generalized Assignment Problem
Sandra Long
 
Lidar Point Cloud Classification Using Expectation Maximization Algorithm
Lidar Point Cloud Classification Using Expectation Maximization AlgorithmLidar Point Cloud Classification Using Expectation Maximization Algorithm
Lidar Point Cloud Classification Using Expectation Maximization Algorithm
AIRCC Publishing Corporation
 
LIDAR POINT CLOUD CLASSIFICATION USING EXPECTATION MAXIMIZATION ALGORITHM
LIDAR POINT CLOUD CLASSIFICATION USING EXPECTATION MAXIMIZATION ALGORITHMLIDAR POINT CLOUD CLASSIFICATION USING EXPECTATION MAXIMIZATION ALGORITHM
LIDAR POINT CLOUD CLASSIFICATION USING EXPECTATION MAXIMIZATION ALGORITHM
ijnlc
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...
IJECEIAES
 
Linear Regression Paper Review.pptx
Linear Regression Paper Review.pptxLinear Regression Paper Review.pptx
Linear Regression Paper Review.pptx
MurindanyiSudi1
 
Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...
Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...
Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...
ijcsit
 
A review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementationA review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementation
ssuserfa7e73
 
1809.05680.pdf
1809.05680.pdf1809.05680.pdf
1809.05680.pdf
AnkitBiswas31
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
 
Data Imputation by Soft Computing
Data Imputation by Soft ComputingData Imputation by Soft Computing
Data Imputation by Soft Computing
ijtsrd
 
4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx
kumarkaushal17
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
csandit
 
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
IJCSEA Journal
 
A046010107
A046010107A046010107
A046010107
IJERA Editor
 
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
ijscai
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine Learning
Janani C
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 

Similar to GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM (20)

Comparative study of optimization algorithms on convolutional network for aut...
Comparative study of optimization algorithms on convolutional network for aut...Comparative study of optimization algorithms on convolutional network for aut...
Comparative study of optimization algorithms on convolutional network for aut...
 
Modified Multiphase Level Set Image Segmentation Search for Energy Formulatio...
Modified Multiphase Level Set Image Segmentation Search for Energy Formulatio...Modified Multiphase Level Set Image Segmentation Search for Energy Formulatio...
Modified Multiphase Level Set Image Segmentation Search for Energy Formulatio...
 
Memory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital PredistorterMemory Polynomial Based Adaptive Digital Predistorter
Memory Polynomial Based Adaptive Digital Predistorter
 
A Tabu Search Heuristic For The Generalized Assignment Problem
A Tabu Search Heuristic For The Generalized Assignment ProblemA Tabu Search Heuristic For The Generalized Assignment Problem
A Tabu Search Heuristic For The Generalized Assignment Problem
 
Lidar Point Cloud Classification Using Expectation Maximization Algorithm
Lidar Point Cloud Classification Using Expectation Maximization AlgorithmLidar Point Cloud Classification Using Expectation Maximization Algorithm
Lidar Point Cloud Classification Using Expectation Maximization Algorithm
 
LIDAR POINT CLOUD CLASSIFICATION USING EXPECTATION MAXIMIZATION ALGORITHM
LIDAR POINT CLOUD CLASSIFICATION USING EXPECTATION MAXIMIZATION ALGORITHMLIDAR POINT CLOUD CLASSIFICATION USING EXPECTATION MAXIMIZATION ALGORITHM
LIDAR POINT CLOUD CLASSIFICATION USING EXPECTATION MAXIMIZATION ALGORITHM
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...
 
Linear Regression Paper Review.pptx
Linear Regression Paper Review.pptxLinear Regression Paper Review.pptx
Linear Regression Paper Review.pptx
 
Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...
Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...
Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...
 
A review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementationA review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementation
 
1809.05680.pdf
1809.05680.pdf1809.05680.pdf
1809.05680.pdf
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
Data Imputation by Soft Computing
Data Imputation by Soft ComputingData Imputation by Soft Computing
Data Imputation by Soft Computing
 
4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
 
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
 
A046010107
A046010107A046010107
A046010107
 
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine Learning
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 

Recently uploaded

Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 

Recently uploaded (20)

Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 

GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM

  • 1. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019 DOI :10.5121/ijscai.2019.8103 37 GRADIENT OMISSIVE DESCENT IS A MINIMIZATION ALGORITHM Gustavo A. Lado and Enrique C. Segura Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires Cyprus ABSTRACT This article presents a promising new gradient-based backpropagation algorithm for multi-layer feedforward networks. The method requires no manual selection of global hyperparameters and is capable of dynamic local adaptations using only first-order information at a low computational cost. Its semi- stochastic nature makes it fit for mini-batch training and robust to different architecture choices and data distributions. Experimental evidence shows that the proposed algorithm improves training in terms of both convergence rate and speed as compared with other well known techniques. KEYWORDS Neural Networks, Backpropagation, Cost Function Minimization 1. INTRODUCTION 1.1. MOTIVATION When minimizing a highly non-convex cost function by means of some gradient descent strategy, several problems may arise. If we consider a single direction in the search space, there could appear local minima [5], where it is necessary first to climb up in order to have the possibility to reach a better or global minimum, or plateaus [16], where partial derivatives have very small values even though being far from a minimum. Of course, in a multidimensional space these problems can combine to produce other ones, such as partial minima (ravines) [1], where for some direction (subspace) there exists a local minimum and for other direction there is a plateau, and the saddle points [3], where partial derivatives have small values since a partial minimum appears in a direction and a maximum appears in another direction. Even if considering a single search direction it can be easily seen that an algorithm using only gradient information might be in trouble when looking for a good minimum. 1.2. BACKGROUND Approaches such as Stochastic Gradient Descent (SGD) [15] or Simulated Annealing (SA) [14, 7] introduce a random factor in the search which permits, to some extent, overcome the problem of local minima, whilst other methods as Gradient Descent with Momentum (GDM) [11] or adaptive parameters [2] allow for speeding up the search in the case of plateaus. When combined with incremental or mini-batch training, these techniques can be effective to some degree even with ravines and saddle points, but they use global hyperparameters that are generally problem- dependent.
  • 2. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019 38 Other techniques such as Conjugate Gradient Descent (CGD) [12], or others which incorporate second derivative information, as the Newton Method and the Levenberg-Marquardt algorithm [17], can speed up the search through a better analysis of the cost function, but are in general computationally expensive and highly complex, turning more difficult its implementation and use. Finally, more recent algorithms such as Quick Back-Propagation (QBP) [17], Resilient Back- Propagation (RBP) [13] and Adaptive Momentum Estimation (AME) [19, 4] use adaptive parameters independent for each search direction, hence they do not require selection of global hyperparameters and can adapt its strategy as to cope with different kinds of problems. However, even in these situations, they can get stuck in bad solutions (poor minima). In addition, all these techniques require batch training, which in many cases may be not only inconvenient but impossible or highly impractical. The main difficulty in the search of a good minimum (global or not) is that these algorithms try to take good decisions solely on the base of local information and values that, in many cases, are the result of estimations of estimations. In other words: it is difficult to take intelligent decisions with incomplete and inaccurate information. Perhaps a better approach would be just to try that by the end of each epoch the probability of cost decrease be greater than that of increase and, in the case that it increased, it have a reasonably high probability of escaping to another basin of the cost surface. Thus, instead of trying to ensure convergence through analysis of the cost function, the matter is to warrant statistical tendency to a good minimum. 2. THE PROPOSED ALGORITHM Our method borrows in part certain ideas from other techniques. For example, from SGD the introduction of a random element in the descent; from CGD the exploration of the cost function - though nonlinear in this case- and from RBP the use of the sign of the gradient as indicator of the search direction, omitting its magnitude [12,13, 14, 15, 17]. The main idea is that during training the search moves across the cost surface with random steps within a radius of search, so as to statistically converge to the minimum; and this radius decreases as the solution is approached. 2.1. DESCRIPTION In each epoch the dataset is divided into mini-batches. A partial estimation of the cost function E is made for each one of those mini-batches, and the gradient is computed by standard error backpropagation. Then weights are modified in a random magnitude, bounded by a value R, but using the direction (sign) of the △W previously computed. At the end of each epoch, the cost is compared to that of earlier steps and, should it correspond, necessary adjustments are made on R. If we look at each mini-batch as a sample of the dataset, we can consider each epoch as a Montecarlo estimation of the cost function. Similarly, the fact that the search radius gradually decreases suggests a resemblance to simulated annealing [10]. However, to impose a fixed reduction of R as a function of the number of epochs or of the cost is impractical due to the variability between different problems. For this reason a strategy was chosen where adjustments to R are dynamically applied, depending on the relative cost [9]. This turns it more robust, but the adjustments must be small, in the order of 1%, since by the very nature of the estimation of E, oscillations might occur.
  • 3. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019 39 Then the basic idea for modifying R can be expressed as follows: if the cost increased with regard to the last epoch, reduce the radius of search. This is relatively effective in a variety of scenarios, but these results can still be improved by using some local parameters instead of global ones [18]. For example, at the end of each epoch an estimation can be made of the individual contribution of each unit to the total cost, that is, the cost by unit (Eu). Similarly, instead of using a global radius of search R, we can have an individual radius (Ru) for each unit. Under these conditions the rule for adjustment of search radii is modified as follows: if the total cost increased, reduce search radius for units with higher error and increase radius for the others (see Algorithm 1). Then, in this strategy the radius of search can be seen as the level of resolution of the cost function. If this function cannot decrease any more, the radius is reduced in order to look at the cost surface in more detail [2]. Increasing the radius for those units whose cost grew up helps to maintain an equilibrium and possibly to escape from a bad solution. Algorithm 1. The Gradient Omissive Descent algorithm. initialize W, R, E, t while (E > min_error)∧(t < max_epoch): E[t] ← 0 for every batch b: W ← W + U(0, R)|W| × sgn(∂E(W, b)∂W) E[t] ← E[t] + E(W, b) dE ← E[t] − E[t − 1] if dE > 0: for every unit u: if dEu > 0: Ru ← Ru × 0.99 else: Ru ← Ru × 1.01 t ← t + 1 3. EXPERIMENTS 3.1. METHODS In all experiments our technique, Gradient Omissive Descent (GOD), was compared with three other algorithms: SGD, GDM, RBP. These were chosen in order to cover the alternatives of incremental training, mini-batch and batch respectively. In future work we will also include other approaches such as CGD, though the results might not be easy to compare, and AME, for which we do not have yet a properly tested implementation. The stop condition in all simulations was either the completion of a maximum number of epochs or having reached an error under a minimum value. This error was computed as the average of all the errors in the training dataset. Since data were artificially generated, it was possible to keep the same dataset size through all the tests.
  • 4. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019 40 As a measure of the efficacy of the algorithms, and since there exists no unique stop condition, the quantity epoch⁄max_epoch × error⁄min_error was used, where epoch and error correspond to final values, and max_epoch and min_error are the values for the stop condition. Thus, lower values indicate a higher efficacy and, in particular, quantities less than 1 imply that a solution has been obtained. Experiments were realized on two kinds of problems, typical for evaluation of this class of algorithms: parity and auto-encoding. In both cases, additional constraints were introduced in order to have a better control and thus to adjust the level of difficulty [8]. Final results were computed as the average of several simulations under the same constraints but with different datasets. 3.1.1. PARITY The parity problem, i.e. multivariate XOR, is defined as the result of verifying if the number of 1’s in a list of binary variables is even. To generate data for these experiments we use variables xi ∈ { − 1, 1} instead of binary, defining the function even as: Where u(x) = 1 if x = 1, and equals to zero otherwise. In its simplest form, this problem consists of mapping n inputs into a single output which indicates the parity. What makes it particularly interesting is the fact that changing the value of just one input variable produces a change in the output. In our experiments we used 10 input variables [x0..x9] and 10 output variables [z0..z9], where zi = even(x0..xi). That is, each output corresponds to the parity on a portion of the input variables. This turns the problem more difficult, as the unique hidden layer must find a representation satisfying simultaneously all outputs. 3.1.2. AUTO-ENCODING WITH DEPENDENT VARIABLES One of the critical factors in the efficacy of an auto-encoding is how independent are the data between each other. If the dataset is built up with completely independent variables and of the same variance, it is not possible to make an effective reduction of dimension [6]. For this reason, the datasets for the experiments were constructed with different proportions between independent and dependent variables. Independent variables were drawn from a random uniform distribution between -1 and 1, and the dependent ones were random linear combinations of the former. Final values result from applying the sign function in order to work also in this case with values in { − 1, 1}. However, in real situations, data are not often equally correlated between them. This fact was also taken into account, splitting the dataset into several classes. Each class not only uses different correlations for dependent variables, but also assigns different positions to the independent ones.
  • 5. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019 41 3.2. RESULTS This section presents comparisons between the results obtained with the different algorithms under consideration for the problems described above. For each method, the parameters that produced the better results were chosen. In all simulations the stop condition was 3000 epochs or an average error less than 10 − 3. These settings had the purpose to ensure that all the algorithms had the possibility to converge. The weights were independently initialized at random from an uniform distribution over the interval { − N − 12, N − 12}, being N the number of neurons in the previous layer. Initial values for R were computed as 1 ⁄N × M, where N is the number of units in the previous layer and M is the number of units in the same layer. Each mini-batch, both for GDM and for GOD, consisted of a 10% of the elements of the dataset, chosen at random without repetition. 3.2.1. PARITY For this group of experiments the network had a single hidden layer of 40 units. The training set included all the 210 possible combinations for the input. For SGD the learning rate η was 0.01, for GDM η = 0.001 and the momentum factor β = 0.5. For RBP, η + = 1.2 and η − = 0.5 in all cases, as proposed in [13]. Figure 1. Training results of several algorithms for multiple parity. Figure 1 shows the time evolution of the sum of square errors by epoch for different algorithms. These results correspond to a typical run of training, since the average of several simulations
  • 6. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019 42 would not show the differences in the behaviour of different methods. Although the inner stochasticity of the training process introduces large variations in the way of descent for each algorithm, the order of convergence is markedly consistent. It can be noted that the cost decreases much more rapidly with the proposed method. 3.2.2. AUTO-ENCODING WITH DEPENDENT VARIABLES A network with 100 input and 100 output neurons was used, while the size of the hidden layer was varied between 70, 80 and 90 units. The training set had 1000 instances, from which 25% were set apart for validation. In view of the number of simulations under the same conditions, the validation method used was holdout, since in these circumstances there is no need for more partitions of cross-validation. The parameters were: for SGD η = 0.001; for GDM η = 0.001, β = 0.5. Figure 2 shows the variations in efficacy of different algorithms as changing the proportion of independent variables and the number of units in the hidden layer. The three parts of the figure correspond to the variation of number of units in the hidden layer, indicated on the left axis (a smaller value corresponds to a more difficult problem); percentages on the horizontal axis show the proportion of dependent variables (a smaller value also corresponds to a more difficult problem), and values on vertical axes indicate the efficacy of the algorithms computed as previously defined and averaged over several runs. In these simulations, 5 classes were included; results for other number of classes are not displayed here since they show a not quite pronounced variation, but consistent with the problem complexity. Figure 2. Comparison of the performance of different algorithms in problems of auto-encoding with correlated data.
  • 7. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019 43 It is worth noting not only how the efficacy of the proposed technique varies with problems of different complexity, but also how varies the relative behaviour of all compared methods. Depending on the complexity of the problem, the quality of our solution relatively changes with regard to other algorithms, but it can also be seen a clear trend to a faster convergence. 4. CONCLUSIONS In the present work an optimization technique was introduced with the ability to find minima or quasi-minima of cost functions for non-convex, highly dimensional problems. The proposed algorithm manages successfully some classical difficulties reported in the literature, such as local minima, plateaus, ravines and saddle points. In our view, the method is an advance regarding the previous ones, from which it also takes inspiration, in the form of an exploration of the cost function, in a semi-random descent, and omitting the magnitude of the gradient. The proposed algorithm works efficiently on a wide range of problems, featuring convergence rate and speed higher than other common methods. We also note that in certain cases, for example when the problem is very simple, less sophisticated algorithms as SGD can be more effective, but in its turn have the disadvantage of requiring a parameter setting. Similarly, in problems which can be intractable -due for example to architecture limitations- our solution is comparable to RPB but, unlike it, our method can work in mini-batches. ACKNOWLEDGEMENTS The idea for this algorithm was fundamentally conceived while L. G. was working with Professor Verónica Dahl in Vancouver, BC. Special thanks to her for the invitation and to the Emerging Leaders in the Americas Program (ELAP) of the Canada’s government for making that visit possible. REFERENCES [1] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun. The loss surfaces of multilayer networks. Artificial Intelligence and Statistics:192-204, 2015. [2] Christian Darken, Joseph Chang, John Moody. Learning rate schedules for faster stochastic gradient search. Neural Networks for Signal Processing [1992] II., Proceedings of the 1992 IEEE- SP Workshop:3-12, 1992. [3] Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems:2933-2941, 2014. [4] Timothy Dozat. Incorporating nesterov momentum into adam. , 2016. [5] Marco Gori, Alberto Tesi. On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(1):76-86, 1992. [6] Geoffrey E. Hinton, Ruslan R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504-507, 2006.
  • 8. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019 44 [7] Scott Kirkpatrick, C. Daniel Gelatt, Mario P. Vecchi, others. Optimization by simulated annealing. science, 220(4598):671-680, 1983. [8] Steve Lawrence, C. Lee Giles, Ah Chung Tsoi. What size neural network gives optimal generalization?: Convergence properties of backpropagation. 1998. [9] George D. Magoulas, Michael N. Vrahatis, George S. Androulakis. Improving the convergence of the backpropagation algorithm using learning rate adaptation methods. Neural Computation, 11(7):1769-1796, 1999. [10] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087-1092, 1953. [11] Miguel Moreira, Emile Fiesler. Neural networks with adaptive learning rate and momentum terms. Idiap, 1995. [12] Martin Fodslette Møller. A scaled conjugate gradient algorithm for fast supervised learning. Neural networks, 6(4):525-533, 1993. [13] Martin Riedmiller, Heinrich Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. Neural Networks, 1993., IEEE International Conference on:586-591, 1993. [14] Herbert Robbins, Sutton Monro. A stochastic approximation method. The annals of mathematical statistics:400-407, 1951. [15] David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams. Learning representations by back- propagating errors. nature, 323(6088):533, 1986. [16] Richard S. Sutton. Two problems with backpropagation and other steepest-descent learning procedures for networks. Proc. 8th annual conf. cognitive science society:823-831, 1986. [17] Bogdan M. Wilamowski. Neural network architectures and learning. Industrial Technology, 2003 IEEE International Conference on, 1:-1, 2003. [18] Z. Zainuddin, N. Mahat, Y. Abu Hassan. Improving the Convergence of the Backpropagation Algorithm Using Local Adaptive Techniques. International Conference on Computational Intelligence:173-176, 2004. [19] Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012. AUTHORS Gustavo Lado is a graduated student from the University of Buenos Aires in Computing Science. His previous research interests include the simulation of the human visual system with artificial neural networks. Part of this work was recently published in the book "La Percepción del Hombre y sus Máquinas". He is currently working in neural networks applied to the natural language processing and cognitive semantics as part his Ph.D.
  • 9. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.8, No.1, February 2019 45 Enrique Carlos Segura was born in Buenos Aires, Argentina. He received the M.Sc. degrees in Mathematics and Computer Science in 1988 and the Ph.D. degree in Mathematics in 1999, all from University of Buenos Aires. He was Fellow at the CONICET (National Research Council, Argentina) and at the CONEA (Atomic Energy Agency, Argentina). He had also a Thalmann Fellowship to work at the Universitat Politécnica de Catalunya (Spain) and a Research Fellowship at South Bank University (London). From 2003 to 2005 he was Director of the Department of Computer Science of the University of Buenos Aires, where he is at present a Professor and Resarcher.His main areas of research are Artificial neural Networks -theory and applications- and, in general, Cognitive Models of Learning and Memory.