Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An introduction to neural network

92 views

Published on

CATI Bios4Biol
March 30th, 2020
Remote conference

Published in: Science
  • Be the first to comment

  • Be the first to like this

An introduction to neural network

  1. 1. An introduction to neural networks Nathalie Vialaneix & Hyphen nathalie.vialaneix@inra.fr http://www.nathalievialaneix.eu & https://www.hyphen-stat.com/en/ DATE ET LIEU Nathalie Vialaneix & Hyphen | Introduction to neural networks 1/41
  2. 2. Outline 1 Statistical / machine / automatic learning Background and notations Underfitting / Overfitting Consistency Avoiding overfitting 2 (non deep) Neural networks Presentation of multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice 3 Deep neural networks Overview CNN Nathalie Vialaneix & Hyphen | Introduction to neural networks 2/41
  3. 3. Outline 1 Statistical / machine / automatic learning Background and notations Underfitting / Overfitting Consistency Avoiding overfitting 2 (non deep) Neural networks Presentation of multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice 3 Deep neural networks Overview CNN Nathalie Vialaneix & Hyphen | Introduction to neural networks 3/41
  4. 4. Background Purpose: predict Y from X; Nathalie Vialaneix & Hyphen | Introduction to neural networks 4/41
  5. 5. Background Purpose: predict Y from X; What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn); Nathalie Vialaneix & Hyphen | Introduction to neural networks 4/41
  6. 6. Background Purpose: predict Y from X; What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn); What we want: estimate unknown Y from new X: xn+1, . . . , xm. Nathalie Vialaneix & Hyphen | Introduction to neural networks 4/41
  7. 7. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). Nathalie Vialaneix & Hyphen | Introduction to neural networks 5/41
  8. 8. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de régression; if Y is a factor, Φn is called a classifier classifieur; Nathalie Vialaneix & Hyphen | Introduction to neural networks 5/41
  9. 9. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de régression; if Y is a factor, Φn is called a classifier classifieur; Φn is said to be trained or learned from the observations (xi, yi)i. Nathalie Vialaneix & Hyphen | Introduction to neural networks 5/41
  10. 10. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de régression; if Y is a factor, Φn is called a classifier classifieur; Φn is said to be trained or learned from the observations (xi, yi)i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; Nathalie Vialaneix & Hyphen | Introduction to neural networks 5/41
  11. 11. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de régression; if Y is a factor, Φn is called a classifier classifieur; Φn is said to be trained or learned from the observations (xi, yi)i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; generalization ability: predictions made on new data are also accurate. Nathalie Vialaneix & Hyphen | Introduction to neural networks 5/41
  12. 12. Basics From (xi, yi)i, definition of a machine, Φn s.t.: ˆynew = Φn (xnew). if Y is numeric, Φn is called a regression function fonction de régression; if Y is a factor, Φn is called a classifier classifieur; Φn is said to be trained or learned from the observations (xi, yi)i. Desirable properties accuracy to the observations: predictions made on known data are close to observed values; generalization ability: predictions made on new data are also accurate. Conflicting objectives!! Nathalie Vialaneix & Hyphen | Introduction to neural networks 5/41
  13. 13. Underfitting/Overfitting sous/sur - apprentissage Function x → y to be estimated Nathalie Vialaneix & Hyphen | Introduction to neural networks 6/41
  14. 14. Underfitting/Overfitting sous/sur - apprentissage Observations we might have Nathalie Vialaneix & Hyphen | Introduction to neural networks 6/41
  15. 15. Underfitting/Overfitting sous/sur - apprentissage Observations we do have Nathalie Vialaneix & Hyphen | Introduction to neural networks 6/41
  16. 16. Underfitting/Overfitting sous/sur - apprentissage First estimation from the observations: underfitting Nathalie Vialaneix & Hyphen | Introduction to neural networks 6/41
  17. 17. Underfitting/Overfitting sous/sur - apprentissage Second estimation from the observations: accurate estimation Nathalie Vialaneix & Hyphen | Introduction to neural networks 6/41
  18. 18. Underfitting/Overfitting sous/sur - apprentissage Third estimation from the observations: overfitting Nathalie Vialaneix & Hyphen | Introduction to neural networks 6/41
  19. 19. Underfitting/Overfitting sous/sur - apprentissage Summary Nathalie Vialaneix & Hyphen | Introduction to neural networks 6/41
  20. 20. Errors training error (measures the accuracy to the observations) Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
  21. 21. Errors training error (measures the accuracy to the observations) if y is a factor: misclassification rate {ˆyi yi, i = 1, . . . , n} n Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
  22. 22. Errors training error (measures the accuracy to the observations) if y is a factor: misclassification rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
  23. 23. Errors training error (measures the accuracy to the observations) if y is a factor: misclassification rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 or root mean square error (RMSE) or pseudo-R2 : 1−MSE/Var((yi)i) Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
  24. 24. Errors training error (measures the accuracy to the observations) if y is a factor: misclassification rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 or root mean square error (RMSE) or pseudo-R2 : 1−MSE/Var((yi)i) test error: a way to prevent overfitting (estimates the generalization error) is the simple validation Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
  25. 25. Errors training error (measures the accuracy to the observations) if y is a factor: misclassification rate {ˆyi yi, i = 1, . . . , n} n if y is numeric: mean square error (MSE) 1 n n i=1 (ˆyi − yi)2 or root mean square error (RMSE) or pseudo-R2 : 1−MSE/Var((yi)i) test error: a way to prevent overfitting (estimates the generalization error) is the simple validation 1 split the data into training/test sets (usually 80%/20%) 2 train Φn from the training dataset 3 compute the test error from the remaining data Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
  26. 26. Example Observations Nathalie Vialaneix & Hyphen | Introduction to neural networks 8/41
  27. 27. Example Training/Test datasets Nathalie Vialaneix & Hyphen | Introduction to neural networks 8/41
  28. 28. Example Training/Test errors Nathalie Vialaneix & Hyphen | Introduction to neural networks 8/41
  29. 29. Example Summary Nathalie Vialaneix & Hyphen | Introduction to neural networks 8/41
  30. 30. Practical consequences A machine or an algorithm is a method that: start from a given set of prediction functions C use observations (xi, yi)i to pick the “best” prediction function according to what has already been observed. This is often done by the estimation of some parameters (e.g., coefficients in linear models, weights in neural networks, ...) Nathalie Vialaneix & Hyphen | Introduction to neural networks 9/41
  31. 31. Practical consequences A machine or an algorithm is a method that: start from a given set of prediction functions C use observations (xi, yi)i to pick the “best” prediction function according to what has already been observed. This is often done by the estimation of some parameters (e.g., coefficients in linear models, weights in neural networks, ...) C can depend on user choices (e.g., achitecture, number of neurons in neural networks) ⇒ hyper-parameters parameters (learned from data) hyper-parameters (can not be learned from data) Nathalie Vialaneix & Hyphen | Introduction to neural networks 9/41
  32. 32. Practical consequences A machine or an algorithm is a method that: start from a given set of prediction functions C use observations (xi, yi)i to pick the “best” prediction function according to what has already been observed. This is often done by the estimation of some parameters (e.g., coefficients in linear models, weights in neural networks, ...) C can depend on user choices (e.g., achitecture, number of neurons in neural networks) ⇒ hyper-parameters parameters (learned from data) hyper-parameters (can not be learned from data) hyper-parameters often control the richness of C: must be carefully tuned to ensure a good accuracy and avoid overfitting Nathalie Vialaneix & Hyphen | Introduction to neural networks 9/41
  33. 33. Model selection / Fair error evaluation Several strategies can be used to estimate the error of a set C and to be able to select the best family of models C (tuning of hyper-parameters): split the data into a training/test set (∼ 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; Nathalie Vialaneix & Hyphen | Introduction to neural networks 10/41
  34. 34. Model selection / Fair error evaluation Several strategies can be used to estimate the error of a set C and to be able to select the best family of models C (tuning of hyper-parameters): split the data into a training/test set (∼ 67/33%) cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; fold 1 fold 2 fold 3 . . . fold K train without fold 2: Φ−2 error fold 2: err(φ−2 , fold 2) CV error: 1 K l err(Φ−l , fold l) Nathalie Vialaneix & Hyphen | Introduction to neural networks 10/41
  35. 35. Model selection / Fair error evaluation Several strategies can be used to estimate the error of a set C and to be able to select the best family of models C (tuning of hyper-parameters): split the data into a training/test set (∼ 67/33%) cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; out-of-bag error: create B bootstrap samples (size n taken with replacement in the original sample) and compute a prediction based on Out-Of-Bag samples (see next slide). Nathalie Vialaneix & Hyphen | Introduction to neural networks 10/41
  36. 36. Out-of-bag observations, prediction and error OOB (Out-Of Bags) error: error based on the observations not included in the “bag”: (xi, yi) i ∈ T 1 i ∈ T b i T B Φ1 Φb ΦB OOB prediction: ΦB (xi) used to train the model prediction algorithm Nathalie Vialaneix & Hyphen | Introduction to neural networks 11/41
  37. 37. Other model selection methods and strategies to avoid overfitting Main idea: Penalize the training error with something that represents the complexity of the model model selection with explicit penalization strategy: minimize (∼) error + complexity Example: BIC/AIC criterion in linear models −2L + log n × p avoiding overfitting with implicit penalization strategy: constrain the parameters w to take their values within a reduced subset min error(w) + λ w with w the parameter of the model (learned) and . typically the 1 or 2 norm Nathalie Vialaneix & Hyphen | Introduction to neural networks 12/41
  38. 38. Outline 1 Statistical / machine / automatic learning Background and notations Underfitting / Overfitting Consistency Avoiding overfitting 2 (non deep) Neural networks Presentation of multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice 3 Deep neural networks Overview CNN Nathalie Vialaneix & Hyphen | Introduction to neural networks 13/41
  39. 39. What are (artificial) neural networks? Common properties (artificial) “Neural networks”: general name for supervised and unsupervised methods developed in (vague) analogy to the brain; Nathalie Vialaneix & Hyphen | Introduction to neural networks 14/41
  40. 40. What are (artificial) neural networks? Common properties (artificial) “Neural networks”: general name for supervised and unsupervised methods developed in (vague) analogy to the brain; combination (network) of simple elements (neurons). Nathalie Vialaneix & Hyphen | Introduction to neural networks 14/41
  41. 41. What are (artificial) neural networks? Common properties (artificial) “Neural networks”: general name for supervised and unsupervised methods developed in (vague) analogy to the brain; combination (network) of simple elements (neurons). Example of graphical representation: INPUTS OUTPUTS Nathalie Vialaneix & Hyphen | Introduction to neural networks 14/41
  42. 42. Different types of neural networks A neural network is defined by: 1 the network structure; 2 the neuron type. Nathalie Vialaneix & Hyphen | Introduction to neural networks 15/41
  43. 43. Different types of neural networks A neural network is defined by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classification and regression); Nathalie Vialaneix & Hyphen | Introduction to neural networks 15/41
  44. 44. Different types of neural networks A neural network is defined by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classification and regression); Radial basis function networks (RBF): same purpose but based on local smoothing; Nathalie Vialaneix & Hyphen | Introduction to neural networks 15/41
  45. 45. Different types of neural networks A neural network is defined by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classification and regression); Radial basis function networks (RBF): same purpose but based on local smoothing; Self-organizing maps (SOM also sometimes called Kohonen’s maps) or Topographic maps: dedicated to unsupervised problems (clustering), self-organized; . . . Nathalie Vialaneix & Hyphen | Introduction to neural networks 15/41
  46. 46. Different types of neural networks A neural network is defined by: 1 the network structure; 2 the neuron type. Standard examples Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to supervised problems (classification and regression); Radial basis function networks (RBF): same purpose but based on local smoothing; Self-organizing maps (SOM also sometimes called Kohonen’s maps) or Topographic maps: dedicated to unsupervised problems (clustering), self-organized; . . . In this talk, focus on MLP. Nathalie Vialaneix & Hyphen | Introduction to neural networks 15/41
  47. 47. MLP: Advantages/Drawbacks Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: flexible; good theoretical properties. Nathalie Vialaneix & Hyphen | Introduction to neural networks 16/41
  48. 48. MLP: Advantages/Drawbacks Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method: flexible; good theoretical properties. Drawbacks hard to train (high computational cost, especially when d is large); overfit easily; “black box” models (hard to interpret). Nathalie Vialaneix & Hyphen | Introduction to neural networks 16/41
  49. 49. References Advised references: [Bishop, 1995, Ripley, 1996] overview of the topic from a learning (more than statistical) perspective [Devroye et al., 1996, Györfi et al., 2002] in dedicated chapters present statistical properties of perceptrons Nathalie Vialaneix & Hyphen | Introduction to neural networks 17/41
  50. 50. Analogy to the brain 1 a neuron collects signals from neighboring neurons through its dendrites connexions which frequently lead to activating a neuron are enforced (tend to have an increasing impact on the destination neuron) Nathalie Vialaneix & Hyphen | Introduction to neural networks 18/41
  51. 51. Analogy to the brain 1 a neuron collects signals from neighboring neurons through its dendrites 2 when total signal is above a given threshold, the neuron is activated connexions which frequently lead to activating a neuron are enforced (tend to have an increasing impact on the destination neuron) Nathalie Vialaneix & Hyphen | Introduction to neural networks 18/41
  52. 52. Analogy to the brain 1 a neuron collects signals from neighboring neurons through its dendrites 2 when total signal is above a given threshold, the neuron is activated 3 ... and a signal is sent to other neurons through the axon connexions which frequently lead to activating a neuron are enforced (tend to have an increasing impact on the destination neuron) Nathalie Vialaneix & Hyphen | Introduction to neural networks 18/41
  53. 53. First model of artificial neuron [Mc Culloch and Pitts, 1943, Rosenblatt, 1958, Rosenblatt, 1962] x(1) x(2) x(p) f(x)Σ+ w1 w2 wp w0 f : x ∈ Rp → 1 p j=1 wjx(j)+w0 ≥ 0 Nathalie Vialaneix & Hyphen | Introduction to neural networks 19/41
  54. 54. (artificial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Nathalie Vialaneix & Hyphen | Introduction to neural networks 20/41
  55. 55. (artificial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 weights w (1) jk Nathalie Vialaneix & Hyphen | Introduction to neural networks 20/41
  56. 56. (artificial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 Nathalie Vialaneix & Hyphen | Introduction to neural networks 20/41
  57. 57. (artificial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 Layer 2 Nathalie Vialaneix & Hyphen | Introduction to neural networks 20/41
  58. 58. (artificial) Perceptron Layers MLP have one input layer (x ∈ Rp ), one output layer (y ∈ R or ∈ {1, . . . , K − 1} values) and several hidden layers; no connections within a layer; connections between two consecutive layers (feedforward). Example (regression, y ∈ R): INPUTS x = (x(1) , . . . , x(p) ) Layer 1 Layer 2 y OUTPUTS 2 hidden layer MLP Nathalie Vialaneix & Hyphen | Introduction to neural networks 20/41
  59. 59. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 + w0 (Bias Biais) Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
  60. 60. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
  61. 61. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions fonctions de transfert / d’activation Biologically inspired: Heaviside function h(z) = 0 if z < 0 1 otherwise. Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
  62. 62. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions Main issue with the Heaviside function: not continuous! Identity h(z) = z Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
  63. 63. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions But identity activation function gives linear model if used with one hidden layer: not flexible enough Logistic function h(z) = 1 1+exp(−z) Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
  64. 64. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 Standard activation functions Another popular activation function (useful to model positive real numbers) Rectified linear (ReLU) h(z) = max(0, z) Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
  65. 65. A neuron in MLP v1 v2 v3 ×w1 ×w2 ×w3 ×w2 ×w1 + w0 General sigmoid sigmoid: nondecreasing function h : R → R such that lim z→+∞ h(z) = 1 lim z→−∞ h(z) = 0 Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
  66. 66. Focus on one-hidden-layer perceptrons (R package nnet) Regression case x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q Φ(x) Φ(x) = Q k=1 w (2) k hk x w (1) k + w (0) k + w (2) 0 , with hk a (logistic) sigmoid Nathalie Vialaneix & Hyphen | Introduction to neural networks 22/41
  67. 67. Focus on one-hidden-layer perceptrons (R package nnet) Binary classification case (y ∈ {0, 1}) x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q φ(x) φ(x) = h0   Q k=1 w (2) k hk x w (1) k + w (0) k + w (2) 0   with h0 logistic sigmoid or identity. Nathalie Vialaneix & Hyphen | Introduction to neural networks 22/41
  68. 68. Focus on one-hidden-layer perceptrons (R package nnet) Binary classification case (y ∈ {0, 1}) x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q φ(x) decision with: Φ(x) = 0 if φ(x) < 1/2 1 otherwise Nathalie Vialaneix & Hyphen | Introduction to neural networks 22/41
  69. 69. Focus on one-hidden-layer perceptrons (R package nnet) Extension to any classification problem in {1, . . . , K − 1} x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q φ(x) Straightforward extension to multiple classes with a multiple output perceptron (number of output units equal to K) and a maximum probability rule for the decision. Nathalie Vialaneix & Hyphen | Introduction to neural networks 22/41
  70. 70. Theoretical properties of perceptrons This section answers two questions: 1 can we approximate any function g : [0, 1]p → R arbitrary well with a perceptron? Nathalie Vialaneix & Hyphen | Introduction to neural networks 23/41
  71. 71. Theoretical properties of perceptrons This section answers two questions: 1 can we approximate any function g : [0, 1]p → R arbitrary well with a perceptron? 2 when a perceptron is trained with i.i.d. observations from an arbitrary random variable pair (X, Y), is it consistent? (i.e., does it reach the minimum possible error asymptotically when the number of observations grows to infinity?) Nathalie Vialaneix & Hyphen | Introduction to neural networks 23/41
  72. 72. Illustration of the universal approximation property Simple examples a function to approximate: g : [0, 1] → sin 1 x+0.1 Nathalie Vialaneix & Hyphen | Introduction to neural networks 24/41
  73. 73. Illustration of the universal approximation property Simple examples a function to approximate: g : [0, 1] → sin 1 x+0.1 trying to approximate (how this is performed is explained later in this talk) this function with MLP having different numbers of neurons on their hidden layer Nathalie Vialaneix & Hyphen | Introduction to neural networks 24/41
  74. 74. Sketch of main properties of 1-hidden layer MLP 1 universal approximation [Pinkus, 1999, Devroye et al., 1996, Hornik, 1991, Hornik, 1993, Stinchcombe, 1999]: 1-hidden-layer MLP can approximate arbitrary well any (regular enough) function (but, in theory, the number of neurons can grow arbitrarily for that) 2 consistency [Farago and Lugosi, 1993, Devroye et al., 1996, White, 1990, White, 1991, Barron, 1994, McCaffrey and Gallant, 1994]: 1-hidden-layer MLP are universally consistent with a number of neurons that grows slower than O n log n to +∞ ⇒ the number of neurons controls the richness of the MLP and has to increase with n Nathalie Vialaneix & Hyphen | Introduction to neural networks 25/41
  75. 75. Empirical error minimization Given i.i.d. observations of (X, Y), (xi, yi)i, how to choose the weights w? x(1) x(2) x(p) w(1) w(2)+ w (0) 1 + w (0) Q Φw(x) Nathalie Vialaneix & Hyphen | Introduction to neural networks 26/41
  76. 76. Empirical error minimization Given i.i.d. observations of (X, Y), (xi, yi)i, how to choose the weights w? Standard approach: minimize the empirical L2 risk: Ln(w) = n i=1 [fw(xi) − yi]2 with yi ∈ R for the regression case yi ∈ {0, 1} for the classification case, with the associated decision rule x → 1{Φw (x)≤1/2}. Nathalie Vialaneix & Hyphen | Introduction to neural networks 26/41
  77. 77. Empirical error minimization Given i.i.d. observations of (X, Y), (xi, yi)i, how to choose the weights w? Standard approach: minimize the empirical L2 risk: Ln(w) = n i=1 [fw(xi) − yi]2 with yi ∈ R for the regression case yi ∈ {0, 1} for the classification case, with the associated decision rule x → 1{Φw (x)≤1/2}. But: Ln(w) is not convex in w ⇒ general optimization problem Nathalie Vialaneix & Hyphen | Introduction to neural networks 26/41
  78. 78. Optimization with gradient descent Method: initialize (randomly or with some prior knowledge) the weights w(0) ∈ RQp+2Q+1 Batch approach: for t = 1, . . . , T w(t + 1) = w(t) − µ(t) wLn(w)(w(t)); Nathalie Vialaneix & Hyphen | Introduction to neural networks 27/41
  79. 79. Optimization with gradient descent Method: initialize (randomly or with some prior knowledge) the weights w(0) ∈ RQp+2Q+1 Batch approach: for t = 1, . . . , T w(t + 1) = w(t) − µ(t) wLn(w)(w(t)); online (or stochastic) approach: write Ln(w)(w) = n i=1 [fw(xi) − yi]2 =Ei and for t = 1, . . . , T, randomly pick i ∈ {1, . . . , n} and update: w(t + 1) = w(t) − µ(t) wEi(w(t)). Nathalie Vialaneix & Hyphen | Introduction to neural networks 27/41
  80. 80. Discussion about practical choices for this approach batch version converges (from an optimization point of view) to a local minimum of the error for a good choice of µ(t) but convergence can be slow stochastic version is usually inefficient but is useful for large datasets (n large) more efficient algorithms exist to solve the optimization task. The one implemented in the R package nnet uses higher order derivatives (BFGS algorithm) but this is a very active area of research in which many improvements have been made recently in all cases, solutions returned are, at best, local minima which strongly depends on the initialization: using more than one initialization state is advised, as well as checking the variability among a few runs Nathalie Vialaneix & Hyphen | Introduction to neural networks 28/41
  81. 81. Gradient backpropagation method [Rumelhart and Mc Clelland, 1986] The gradient backpropagation rétropropagation du gradient principle is used to easily computes gradients in perceptrons (or in other types of neural network): Nathalie Vialaneix & Hyphen | Introduction to neural networks 29/41
  82. 82. Gradient backpropagation method [Rumelhart and Mc Clelland, 1986] The gradient backpropagation rétropropagation du gradient principle is used to easily computes gradients in perceptrons (or in other types of neural network): This way, stochastic gradient descent alternates: a forward step which aims at calculating outputs from all observations xi given a value of the weights w a backward step in which the gradient backpropagation principle is used to obtain the gradient for the current weights w Nathalie Vialaneix & Hyphen | Introduction to neural networks 29/41
  83. 83. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q Φw(x) initialize weights Nathalie Vialaneix & Hyphen | Introduction to neural networks 30/41
  84. 84. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q Φw(x) Forward step: for all k, compute a (1) k = xi w (1) k + w (0) k Nathalie Vialaneix & Hyphen | Introduction to neural networks 30/41
  85. 85. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q Φw(x) Forward step: for all k, compute z (1) k = hk (a (1) k ) Nathalie Vialaneix & Hyphen | Introduction to neural networks 30/41
  86. 86. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q Φw(x) = a(2) Forward step: compute a(2) = Q k=1 w (2) k z (1) k + w (2) 0 Nathalie Vialaneix & Hyphen | Introduction to neural networks 30/41
  87. 87. Backpropagation in practice x(1) x(2) x(p) + + w(1) w (0) 1 w (0) Q w(2) Φw(x) = a(2) Backward step: compute ∂Ei ∂w (2) k = δ(2) × zk with δ(2) = ∂Ei ∂a(2) = [h0(a(2))−yi ]2 ∂a(2) = 2h0(a(2) ) × [h0(a(2) ) − yi] Nathalie Vialaneix & Hyphen | Introduction to neural networks 30/41
  88. 88. Backpropagation in practice x(1) x(2) x(p) + + w(2) w (0) 1 w (0) Q w(1) Φw(x) Backward step: ∂Ei ∂w (1) kj = δ (1) k × x (j) i with δ (1) k = ∂Ei ∂a (1) k = ∂Ei ∂a(2) × ∂a(2) ∂a (1) k = δ(2) × w (2) k hk (a (1) k ) Nathalie Vialaneix & Hyphen | Introduction to neural networks 30/41
  89. 89. Backpropagation in practice x(1) x(2) x(p) + + w(1) w(2) w (0) 1 w (0) Q Φw(x) Backward step: ∂Ei ∂w (0) k = δ (1) k Nathalie Vialaneix & Hyphen | Introduction to neural networks 30/41
  90. 90. Initialization and stopping of the training algorithm 1 How to initialize weights? Standard choices w (1) jk ∼ N(0, 1/ √ p) and w (2) k ∼ N(0, 1/ √ Q) Nathalie Vialaneix & Hyphen | Introduction to neural networks 31/41
  91. 91. Initialization and stopping of the training algorithm 1 How to initialize weights? Standard choices w (1) jk ∼ N(0, 1/ √ p) and w (2) k ∼ N(0, 1/ √ Q) 2 When to stop the algorithm? (gradient descent or alike) Standard choices: bounded T target value of the error ˆLn(w) target value of the evolution ˆLn(w(t)) − ˆLn(w(t + 1)) Nathalie Vialaneix & Hyphen | Introduction to neural networks 31/41
  92. 92. Initialization and stopping of the training algorithm 1 How to initialize weights? Standard choices w (1) jk ∼ N(0, 1/ √ p) and w (2) k ∼ N(0, 1/ √ Q) In the R package nnet, weights are sampled uniformly between [−0.5, 0.5] or between − 1 maxi X (j) i , 1 maxi X (j) i if X(j) is large. 2 When to stop the algorithm? (gradient descent or alike) Standard choices: bounded T target value of the error ˆLn(w) target value of the evolution ˆLn(w(t)) − ˆLn(w(t + 1)) In the R package nnet, a combination of the three criteria is used and tunable. Nathalie Vialaneix & Hyphen | Introduction to neural networks 31/41
  93. 93. Strategies to avoid overfitting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Nathalie Vialaneix & Hyphen | Introduction to neural networks 32/41
  94. 94. Strategies to avoid overfitting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Early stopping: for Q large enough, use a part of the data as a validation set and stops the training (gradient descent) when the empirical error computed on this dataset starts to increase Nathalie Vialaneix & Hyphen | Introduction to neural networks 32/41
  95. 95. Strategies to avoid overfitting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Early stopping: for Q large enough, use a part of the data as a validation set and stops the training (gradient descent) when the empirical error computed on this dataset starts to increase Weight decay: for Q large enough, penalize the empirical risk with a function of the weights, e.g., ˆLn(w) + λw w Nathalie Vialaneix & Hyphen | Introduction to neural networks 32/41
  96. 96. Strategies to avoid overfitting Properly tune Q with a CV or a bootstrap estimation of the generalization ability of the method Early stopping: for Q large enough, use a part of the data as a validation set and stops the training (gradient descent) when the empirical error computed on this dataset starts to increase Weight decay: for Q large enough, penalize the empirical risk with a function of the weights, e.g., ˆLn(w) + λw w Noise injection: modify the input data with a random noise during the training Nathalie Vialaneix & Hyphen | Introduction to neural networks 32/41
  97. 97. Outline 1 Statistical / machine / automatic learning Background and notations Underfitting / Overfitting Consistency Avoiding overfitting 2 (non deep) Neural networks Presentation of multi-layer perceptrons Theoretical properties of perceptrons Learning perceptrons Learning in practice 3 Deep neural networks Overview CNN Nathalie Vialaneix & Hyphen | Introduction to neural networks 33/41
  98. 98. What is “deep learning”? INPUTS = (x(1) , . . . , x(p) ) Layer 1 Layer 2 y OUTPUTS 2 hidden layer MLP Learning with neural networks except that... the number of neurons is huge!!! Image taken from https://towardsdatascience.com/ Nathalie Vialaneix & Hyphen | Introduction to neural networks 34/41
  99. 99. Main types of deep NN deep MLP Nathalie Vialaneix & Hyphen | Introduction to neural networks 35/41
  100. 100. Main types of deep NN deep MLP CNN (Convolutional Neural Network): used in image processing Images taken from [Krizhevsky et al., 2012, Zeiler and Fergus, 2014] Nathalie Vialaneix & Hyphen | Introduction to neural networks 35/41
  101. 101. Main types of deep NN deep MLP CNN (Convolutional Neural Network): used in image processing RNN (Recurrent Neural Network): used for data with a temporal sequence (in natural language processing for instance) Nathalie Vialaneix & Hyphen | Introduction to neural networks 35/41
  102. 102. Why such a big buzz around deep learning?... ImageNet results http://www.image-net.org/challenges/LSVRC/ Image by courtesy of Stéphane Canu See also: http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ Nathalie Vialaneix & Hyphen | Introduction to neural networks 36/41
  103. 103. Improvements in learning parameters optimization algorithm: gradient descent is used but choice of the descent step (µ(t)) has been improved [Ruder, 2016] and mini-batches are also often used to speed the learning (ex: ADAM, [Kingma and Ba, 2017]) Nathalie Vialaneix & Hyphen | Introduction to neural networks 37/41
  104. 104. Improvements in learning parameters optimization algorithm: gradient descent is used but choice of the descent step (µ(t)) has been improved [Ruder, 2016] and mini-batches are also often used to speed the learning (ex: ADAM, [Kingma and Ba, 2017]) avoid overfitting with dropouts: randomly removing a fraction p of the neurons (and their connexions) during each step of the training step (larger number of iterations to converge but faster update at each iteration) [Srivastava et al., 2014] INPUTS x = (x(1) , . . . , x(p) ) Layer 1 Layer 2 y OUTPUTS Nathalie Vialaneix & Hyphen | Introduction to neural networks 37/41
  105. 105. Improvements in learning parameters optimization algorithm: gradient descent is used but choice of the descent step (µ(t)) has been improved [Ruder, 2016] and mini-batches are also often used to speed the learning (ex: ADAM, [Kingma and Ba, 2017]) avoid overfitting with dropouts: randomly removing a fraction p of the neurons (and their connexions) during each step of the training step (larger number of iterations to converge but faster update at each iteration) [Srivastava et al., 2014] INPUTS x = (x(1) , . . . , x(p) ) Layer 1 Layer 2 y OUTPUTS Nathalie Vialaneix & Hyphen | Introduction to neural networks 37/41
  106. 106. Main available implementations Theano Tensor- Flow Caffe PyTorch (and autograd) Keras Date 2008-2017 2015-... 2013-... 2016-... 2017-... Main dev Montreal University (Y. Bengio) Google Brain Berkeley University community driven (Facebook, Google, ...) François Chollet / RStudio Lan- guage(s) Python C++/ Python C++/ Python/ Matlab Python (with strong GPU accelera- tions) high level API on top of Tensor- Flow (and CNTK, Theano...): Python, R Nathalie Vialaneix & Hyphen | Introduction to neural networks 38/41
  107. 107. General architecture of CNN Image taken from the article https://doi.org/10.3389/fpls.2017.02235 Nathalie Vialaneix & Hyphen | Introduction to neural networks 39/41
  108. 108. Convolutional layer Images taken from https://towardsdatascience.com Nathalie Vialaneix & Hyphen | Introduction to neural networks 40/41
  109. 109. Convolutional layer Images taken from https://towardsdatascience.com Nathalie Vialaneix & Hyphen | Introduction to neural networks 40/41
  110. 110. Convolutional layer Images taken from https://towardsdatascience.com Nathalie Vialaneix & Hyphen | Introduction to neural networks 40/41
  111. 111. Convolutional layer Images taken from https://towardsdatascience.com Nathalie Vialaneix & Hyphen | Introduction to neural networks 40/41
  112. 112. Convolutional layer Images taken from https://towardsdatascience.com Nathalie Vialaneix & Hyphen | Introduction to neural networks 40/41
  113. 113. Pooling layer Basic principle: select a subsample of the previous layer values Generally used: max pooling Image taken from Wikimedia Commons, by Aphex34 Nathalie Vialaneix & Hyphen | Introduction to neural networks 41/41
  114. 114. References Barron, A. (1994). Approximation and estimation bounds for artificial neural networks. Machine Learning, 14:115–133. Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press, New York, USA. Devroye, L., Györfi, L., and Lugosi, G. (1996). A Probabilistic Theory for Pattern Recognition. Springer-Verlag, New York, NY, USA. Farago, A. and Lugosi, G. (1993). Strong universal consistency of neural network classifiers. IEEE Transactions on Information Theory, 39(4):1146–1151. Györfi, L., Kohler, M., Krzy˙zak, A., and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer-Verlag, New York, NY, USA. Hornik, K. (1991). Approximation capabilities of multilayer feedfoward networks. Neural Networks, 4(2):251–257. Hornik, K. (1993). Some new results on neural network approximation. Neural Networks, 6(8):1069–1072. Kingma, D. P. and Ba, J. (2017). Adam: a method for stochastic optimization. arXiv preprint arXiv: 1412.6980. Krizhevsky, A., Sutskever, I., and Hinton, Geoffrey, E. (2012). Nathalie Vialaneix & Hyphen | Introduction to neural networks 41/41
  115. 115. ImageNet classification with deep convolutional neural networks. In Pereira, F., Burges, C., Bottou, L., and Weinberger, K., editors, Proceedings of Neural Information Processing Systems (NIPS 2012), volume 25. Mc Culloch, W. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(4):115–133. McCaffrey, D. and Gallant, A. (1994). Convergence rates for single hidden layer feedforward networks. Neural Networks, 7(1):115–133. Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8:143–195. Ripley, B. (1996). Pattern Recognition and Neural Networks. Cambridge University Press. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–408. Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, DC, USA. Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv: 1609.04747. Rumelhart, D. and Mc Clelland, J. (1986). Parallel Distributed Processing: Exploration in the MicroStructure of Cognition. Nathalie Vialaneix & Hyphen | Introduction to neural networks 41/41
  116. 116. MIT Press, Cambridge, MA, USA. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958. Stinchcombe, M. (1999). Neural network approximation of continuous functionals and continuous functions on compactifications. Neural Network, 12(3):467–477. White, H. (1990). Connectionist nonparametric regression: multilayer feedforward networks can learn arbitraty mappings. Neural Networks, 3:535–549. White, H. (1991). Nonparametric estimation of conditional quantiles using neural networks. In Proceedings of the 23rd Symposium of the Interface: Computing Science and Statistics, pages 190–199, Alexandria, VA, USA. American Statistical Association. Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In Proceedings of European Conference on Computer Vision (ECCV 2014). Nathalie Vialaneix & Hyphen | Introduction to neural networks 41/41

×