High capacity neural network optimization problems: study & solutions exploration  Francis Piéraut, eng., M.A.Sc [email_address] http://fraka6.blogspot.com/
Plan Context: learn language model with NN High capacity NN optimization inefficiency (error # & CPU time) Is this normal? Various optimization problems Some solutions  &  results Contributions Futurs work  Conclusion
Learning algorithm: Neural Network Problem: Find P(c i |x 1 , x 2 ….)  from samples P( x 1 , x 2 …. | c i ) No distribution apriori Complex relationships (non-linear) A solution = Neural Network
sortie z  cible t t 1 t k y 1 x i x D y N w kj w ij x 1 Neural Networks and capacity P(c i |x i )   P(c i |x i ) y 2 y j z 1 Z k
 
High/huge capacity Neural Network y 1 y 2 y 2
Constraints First order stochatic gradient  Standard architecture  One learning rate Overfitting is neglected  Database : « Letters »  26 classes /16 inputs/20000 examples
Errors :Optimization Inefficiency of High Capacity Neural Networks
CPU time: Optimization Inefficiency of High Capacity Neural Networks
Is this inefficiency normal? Hypothesis: No  inefficiency is created by the increase of optimisation problems inherent to the backprop stochastic  Linear solutions vs non-linear solutions Solutions space Solution = reduce ou eliminate problems related to backpropagation
sortie z  cible t z 1 Z k t 1 t k y 1 x i x D y N w kj w ij x 1 Neural Networks and equations y 2 y j
Learning process is slowing down for non-linear relationships
Solutions space of a N+K Neurones Neural Network Solution space of a N  Neurones Neural Network Solutions space
Similar Solutions Initial State Example 5 iterations  3 iterations
Optimisation problems Moving target problem Attenuation and gradient dilution No specialisation mechanism (ex:boosting) Opposites gradients (classification) Symetry problem
sortie z  cible t z 1 Z k t 1 t k y 1 x i x D y N w jk w ij x 1 Neural Networks Optimization Problems Moving target problem Attenuation and gradient dilution No specialisation mechanism (ex:boosting) Opposites gradients (classification) Symetry problem y 2 y j
Explored solutions Incremental Neural Networks Uncoupled architecture Neural Network with parameter part optimisation Etc.
Incremental Neural Networks : first approach
Incremental Neural Networks : first approach (fix weights optimisation)
Hypothesis: Incremental NN OK Incremental NN Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution
Incremental Neural Networks (1): results
Why it doesn’t work? (critical points)
 
 
Incremental Neural Network : second approach (add hidden layers) z 1 z 2 x 1 x 2 z 1 z 2 y 1 x 1 x 2 y 2 y 3 y 4
Cost function curve shape
Hypothesis: Incremental NN (add layers) OK Incremental NN (add layers)  Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution
Incremental Neural Network (2): results
Uncoupled architecture
Hypothesis: Uncoupled Architecture OK Removed Decoupled architecture Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution
In efficiency of high capacity Neural Networks (CPU time)
Efficiency of High capacity Neural Network: decoupled architecture
Hypothesis: Partial Parameters optimization OK Opt. partie Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution
Neural Networks with partial parameters optimization: results All parameters  optimization Max sensitivity optimization
Why predicting parameters? (observation) Époque Valeurs
Hypothesis * Benefit: reduce # iterations by predicting values based on history Parameter prediction Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution
Prediction : Quadratic extrapolation
Prediction : Learning rate increase
Contributions Experimental indication of optimization problem for high capacity NN Same capacity (add hidden layer): Speed up learning Better error rate Presentation of a solution that doesn’t reduce speed when we increase capacity (decoupled architecture/opposite gradients)
Futur works Can we generalised high capacity optimization inefficiency? (more datasets)  In classification task, is decouped architecture a better choice in an generalization point of view?  Is the point critique hypothesis applicable in the context of incremental neural networks? Adding hidden layers: why it doesn’t work for successive layers? Parameters part optimization Better comprehension of results Which selections parameters algorithm is better?  Is there a prediction parameters technique that is efficient?
Conclusion Partially documented  experimentally high capacity Neural Network inefficiency( cpu time/# error ) Various problems Explored solutions: Incremental approach Decoupled architecture Part of parameters optimization Parameters predictions …
Any Questions??
Exemple :solution linéaire
Exemple :solution hautement non-linéaire
Sélection des connections influençant le plus le coût
Sélection des connections influençant le plus l’erreur T = 1 S = 0 T = 0 S = 1 T = 0 S = 0.1 T = 0 S = 0.1
Observation: idealized behavior of the ratio time

Master Defense Slides (translated)

  • 1.
    High capacity neuralnetwork optimization problems: study & solutions exploration Francis Piéraut, eng., M.A.Sc [email_address] http://fraka6.blogspot.com/
  • 2.
    Plan Context: learnlanguage model with NN High capacity NN optimization inefficiency (error # & CPU time) Is this normal? Various optimization problems Some solutions & results Contributions Futurs work Conclusion
  • 3.
    Learning algorithm: NeuralNetwork Problem: Find P(c i |x 1 , x 2 ….) from samples P( x 1 , x 2 …. | c i ) No distribution apriori Complex relationships (non-linear) A solution = Neural Network
  • 4.
    sortie z cible t t 1 t k y 1 x i x D y N w kj w ij x 1 Neural Networks and capacity P(c i |x i ) P(c i |x i ) y 2 y j z 1 Z k
  • 5.
  • 6.
    High/huge capacity NeuralNetwork y 1 y 2 y 2
  • 7.
    Constraints First orderstochatic gradient Standard architecture One learning rate Overfitting is neglected Database : « Letters » 26 classes /16 inputs/20000 examples
  • 8.
    Errors :Optimization Inefficiencyof High Capacity Neural Networks
  • 9.
    CPU time: OptimizationInefficiency of High Capacity Neural Networks
  • 10.
    Is this inefficiencynormal? Hypothesis: No inefficiency is created by the increase of optimisation problems inherent to the backprop stochastic Linear solutions vs non-linear solutions Solutions space Solution = reduce ou eliminate problems related to backpropagation
  • 11.
    sortie z cible t z 1 Z k t 1 t k y 1 x i x D y N w kj w ij x 1 Neural Networks and equations y 2 y j
  • 12.
    Learning process isslowing down for non-linear relationships
  • 13.
    Solutions space ofa N+K Neurones Neural Network Solution space of a N Neurones Neural Network Solutions space
  • 14.
    Similar Solutions InitialState Example 5 iterations 3 iterations
  • 15.
    Optimisation problems Movingtarget problem Attenuation and gradient dilution No specialisation mechanism (ex:boosting) Opposites gradients (classification) Symetry problem
  • 16.
    sortie z cible t z 1 Z k t 1 t k y 1 x i x D y N w jk w ij x 1 Neural Networks Optimization Problems Moving target problem Attenuation and gradient dilution No specialisation mechanism (ex:boosting) Opposites gradients (classification) Symetry problem y 2 y j
  • 17.
    Explored solutions IncrementalNeural Networks Uncoupled architecture Neural Network with parameter part optimisation Etc.
  • 18.
  • 19.
    Incremental Neural Networks: first approach (fix weights optimisation)
  • 20.
    Hypothesis: Incremental NNOK Incremental NN Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution
  • 21.
  • 22.
    Why it doesn’twork? (critical points)
  • 23.
  • 24.
  • 25.
    Incremental Neural Network: second approach (add hidden layers) z 1 z 2 x 1 x 2 z 1 z 2 y 1 x 1 x 2 y 2 y 3 y 4
  • 26.
  • 27.
    Hypothesis: Incremental NN(add layers) OK Incremental NN (add layers) Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution
  • 28.
  • 29.
  • 30.
    Hypothesis: Uncoupled ArchitectureOK Removed Decoupled architecture Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution
  • 31.
    In efficiency ofhigh capacity Neural Networks (CPU time)
  • 32.
    Efficiency of Highcapacity Neural Network: decoupled architecture
  • 33.
    Hypothesis: Partial Parametersoptimization OK Opt. partie Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution
  • 34.
    Neural Networks withpartial parameters optimization: results All parameters optimization Max sensitivity optimization
  • 35.
    Why predicting parameters?(observation) Époque Valeurs
  • 36.
    Hypothesis * Benefit:reduce # iterations by predicting values based on history Parameter prediction Symetry Gradient dillution Specialisation mechanism Opposite gradient Moving target Problems Solution
  • 37.
  • 38.
    Prediction : Learningrate increase
  • 39.
    Contributions Experimental indicationof optimization problem for high capacity NN Same capacity (add hidden layer): Speed up learning Better error rate Presentation of a solution that doesn’t reduce speed when we increase capacity (decoupled architecture/opposite gradients)
  • 40.
    Futur works Canwe generalised high capacity optimization inefficiency? (more datasets) In classification task, is decouped architecture a better choice in an generalization point of view? Is the point critique hypothesis applicable in the context of incremental neural networks? Adding hidden layers: why it doesn’t work for successive layers? Parameters part optimization Better comprehension of results Which selections parameters algorithm is better? Is there a prediction parameters technique that is efficient?
  • 41.
    Conclusion Partially documented experimentally high capacity Neural Network inefficiency( cpu time/# error ) Various problems Explored solutions: Incremental approach Decoupled architecture Part of parameters optimization Parameters predictions …
  • 42.
  • 43.
  • 44.
  • 45.
    Sélection des connectionsinfluençant le plus le coût
  • 46.
    Sélection des connectionsinfluençant le plus l’erreur T = 1 S = 0 T = 0 S = 1 T = 0 S = 0.1 T = 0 S = 0.1
  • 47.