Master Defense Slides (translated)

High capacity neural network optimization problems: study & solutions exploration Francis Piéraut, eng., M.A.Sc [email_address] http://fraka6.blogspot.com/

Plan Context: learn language model with NN High capacity NN optimization inefficiency (error # & CPU time) Is this normal? Various optimization problems Some solutions & results Contributions Futurs work Conclusion

Learning algorithm: Neural Network Problem: Find P(c i |x 1 , x 2 ….) from samples P( x 1 , x 2 …. | c i ) No distribution apriori Complex relationships (non-linear) A solution = Neural Network

sortie z cible t t 1 t k y 1 x i x D y N w kj w ij x 1 Neural Networks and capacity P(c i |x i ) P(c i |x i ) y 2 y j z 1 Z k

High/huge capacity Neural Network y 1 y 2 y 2

Constraints First order stochatic gradient Standard architecture One learning rate Overfitting is neglected Database : « Letters » 26 classes /16 inputs/20000 examples

Errors :Optimization Inefficiency of High Capacity Neural Networks

CPU time: Optimization Inefficiency of High Capacity Neural Networks

Is this inefficiency normal? Hypothesis: No inefficiency is created by the increase of optimisation problems inherent to the backprop stochastic Linear solutions vs non-linear solutions Solutions space Solution = reduce ou eliminate problems related to backpropagation

sortie z cible t z 1 Z k t 1 t k y 1 x i x D y N w kj w ij x 1 Neural Networks and equations y 2 y j

Learning process is slowing down for non-linear relationships

Solutions space of a N+K Neurones Neural Network Solution space of a N Neurones Neural Network Solutions space

Similar Solutions Initial State Example 5 iterations 3 iterations

Optimisation problems Moving target problem Attenuation and gradient dilution No specialisation mechanism (ex:boosting) Opposites gradients (classification) Symetry problem

sortie z cible t z 1 Z k t 1 t k y 1 x i x D y N w jk w ij x 1 Neural Networks Optimization Problems Moving target problem Attenuation and gradient dilution No specialisation mechanism (ex:boosting) Opposites gradients (classification) Symetry problem y 2 y j

Explored solutions Incremental Neural Networks Uncoupled architecture Neural Network with parameter part optimisation Etc.

Incremental Neural Networks : first approach

Incremental Neural Networks : first approach (fix weights optimisation)

Hypothesis: Incremental NN OK Incremental NN Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution

Incremental Neural Networks (1): results

Why it doesn’t work? (critical points)

Incremental Neural Network : second approach (add hidden layers) z 1 z 2 x 1 x 2 z 1 z 2 y 1 x 1 x 2 y 2 y 3 y 4

Hypothesis: Incremental NN (add layers) OK Incremental NN (add layers) Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution

Incremental Neural Network (2): results

Hypothesis: Uncoupled Architecture OK Removed Decoupled architecture Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution

In efficiency of high capacity Neural Networks (CPU time)

Efficiency of High capacity Neural Network: decoupled architecture

Hypothesis: Partial Parameters optimization OK Opt. partie Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution

Neural Networks with partial parameters optimization: results All parameters optimization Max sensitivity optimization

Why predicting parameters? (observation) Époque Valeurs

Hypothesis * Benefit: reduce # iterations by predicting values based on history Parameter prediction Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution

Prediction : Quadratic extrapolation

Prediction : Learning rate increase

Contributions Experimental indication of optimization problem for high capacity NN Same capacity (add hidden layer): Speed up learning Better error rate Presentation of a solution that doesn’t reduce speed when we increase capacity (decoupled architecture/opposite gradients)

Futur works Can we generalised high capacity optimization inefficiency? (more datasets) In classification task, is decouped architecture a better choice in an generalization point of view? Is the point critique hypothesis applicable in the context of incremental neural networks? Adding hidden layers: why it doesn’t work for successive layers? Parameters part optimization Better comprehension of results Which selections parameters algorithm is better? Is there a prediction parameters technique that is efficient?

Conclusion Partially documented experimentally high capacity Neural Network inefficiency( cpu time/# error ) Various problems Explored solutions: Incremental approach Decoupled architecture Part of parameters optimization Parameters predictions …

Exemple :solution hautement non-linéaire

Sélection des connections influençant le plus le coût

Sélection des connections influençant le plus l’erreur T = 1 S = 0 T = 0 S = 1 T = 0 S = 0.1 T = 0 S = 0.1

Observation: idealized behavior of the ratio time

Master Defense Slides (translated)

More Related Content

What's hot

Viewers also liked

Similar to Master Defense Slides (translated)

More from Francis Piéraut

Recently uploaded

Master Defense Slides (translated)