Csss2010 20100803-kanevski-lecture2

0 views
298 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
0
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Csss2010 20100803-kanevski-lecture2

  1. 1. Machine Learning Algorithms: Theory, Applications and Software Tools Lecture 2 Basics of ANN: MLP Prof. Mikhail Kanevski Institute of Geomatics and Analysis of Risk, University of Lausanne Mikhail.Kanevski@unil.ch Prof. M. Kanevski 1
  2. 2. Contents• Introduction to artificial neural networks• Multilayer perceptron• Case studies Prof. M. Kanevski 2
  3. 3. Basics of ANNArtificial neural networks are analytical systems that address problems whose solutions have not been explicitly formulated.In this way they contrast to classical computers and computer programs, which are designed to solve problems whose solutions - although they may be extremely complex - have been made explicit. Prof. M. Kanevski 3
  4. 4. Basics of ANN• We can program or train neural networks to store, recognise, and associatively retrieve patterns;• to filter noise from measurement data;• to control ill-defined problems;in summary:• to estimate sampled functions when we do not know the form of the functions. Prof. M. Kanevski 4
  5. 5. Basics of ANNUnlike statistical estimators, they estimate a function without a mathematical model of how outputs depend on inputs.Neural networks are model-semifree estimators (semiparametric models). They "learn from experience" with numerical and, sometimes, linguistic sample data. Prof. M. Kanevski 5
  6. 6. Basics of ANNThe major applications of ANN: • Feature recognition (pattern classification). Speech recognition • Signal processing • Time-series prediction • Function approximation and regression, classification • Data Mining • Intelligent control • Associative memories • Optimisation • And many others Prof. M. Kanevski 6
  7. 7. Basics of ANN.Simple biological neuron Prof. M. Kanevski 7
  8. 8. Basics of ANNSimple model of the neuron Prof. M. Kanevski 8
  9. 9. Examples of transfer functions. 1 f (x) = [1 + exp( − x )] [exp( x ) − exp( − x )]tanh( x ) = [exp( x ) + exp( − x )] Prof. M. Kanevski 9
  10. 10. Basics of ANNThe main parts of ANN: • Neurones (nodes, cells, units, processing elements) • Network topology (connections between neurones) Prof. M. Kanevski 10
  11. 11. Basics of ANNIn general, Artificial Neural Networks are a collection of simple computational units (cells) interlinked by a system of connections (synaptic connections). The number of units and connections form a network topology. Prof. M. Kanevski 11
  12. 12. Multilayer perceptron Prof. M. Kanevski 12
  13. 13. Basics of ANN. ANN learning/trainingSupervised learning is the most common training. Many samples Input(i), Output(i) are prepared as a training set. Then a subset from the training data set is selected. Samples from this subset are presented to the network one by one. For each sample results obtained by the network O[(input(i)] are compared with the desired O[utput(i)]. After presenting the entire training subset the weights are updated. This updating is done in such a way that a measure of the error between the networks and desired outputs is reduced. One pass through the subset of training samples, along with an updating of the weights is called an epoch. The number of samples in the subset is called epoch size. Sometimes an epoch size of one is used . Prof. M. Kanevski 13
  14. 14. Basics of ANN. ANN supervised learning. TeacherExamples Response Neural network Evaluation Modifications Of Response to Network Learning Algorithm Prof. M. Kanevski 14
  15. 15. Basics of ANN Feedforward ANN.If there are no feedback and lateral connections we have feedforward ANN. The most frequently used model is so called - multi-layer perceptron. The term feedforward means that information flows only in one direction - from the input to the output. Prof. M. Kanevski 15
  16. 16. ANN Multi-layer Perceptron (MLP)• Depends only on the dataand its inner structure• Is able to learn from dataand generalise• Good at modelling non-linearities• Robust to noise andoutliers [ANN = artificial neurons + connection weights] Prof. M. Kanevski 16
  17. 17. Basics of ANNAll knowledge of ANN is based on synaptic weights between units. Prof. M. Kanevski 17
  18. 18. The Universality Property• A two layer feed-forward neural network with step activation functions can implement any Boolean function, provided that the number of hidden neurons H is sufficiently large. Prof. M. Kanevski 18
  19. 19. MLP modellingF1 (t , w ) = w1out f ( w1t + b1 ) + bout ,F2 (t , w ) = w1out f ( w1t + b1 ) + w2 f ( w2t + b2 ) + bout , outF3 (t , w ) = w1out f ( w1t + b1 ) + w2 f ( w2t + b2 ) + w3 f ( w3t + b3 ) + bout . out out Prof. M. Kanevski 19
  20. 20. Backpropagation training Prof. M. Kanevski 20
  21. 21. Error function depends on network’s weights (W) n −1 1 n j =0 {El (W ) = ∑ Tlj − Z lj (W ) out } 2 Prof. M. Kanevski 21
  22. 22. MLP training algorithmsOptimisation algorithms used for MLP training:• Stochastic − Annealing − Genetic algorithm• Gradient − Conjugate gradients (slow 1st order gradient algorithm) − Levenberg-Marquardt (fast 2nd order gradient algorithm) − BFGS formula – quasi Newton − Steepest Descent − RProp – resilient propagation − BackProp – back propagation Prof. M. Kanevski 22
  23. 23. Feedforward ANN: Multilayer perceptron. Backprop algorithm• The possibilities and capabilities of multi-layer perceptrons stem from the non-linearities used within nodes. MLP can learn with supervised learning rule - backpropagation algorithm. The Backword Error Propagation algorithm for the ANN learning/training caused a breakthrough in the application of multilayer perceptrons.• The backpropagation algorithm is a supervised learning algorithm. The backpropagation algorithm is an iterative gradient algorithm designed to minimise the error measure between the actual output of the neural network and the desired output. We have to optimise a very non-linear system consisting of a large number of highly correlated variables. Prof. M. Kanevski 23
  24. 24. Basics of ANN Backpropagation AlgorithmThe backpropagation algorithm follows the next algorithmic steps:• 1. Initialize weights. Usually it is recommended to set all weights and node offsets to small random variables. In our study we shall use simulated annealing and/or genetic algorithm to select starting values more intelligently as it is recommended in [Masters].• 2. Present inputs and desired outputs. The vectors (Inputl, Outputl=tl) are presented to the network.• 3. Calculate the actual output of the ANN. Prof. M. Kanevski 24
  25. 25. Basics of ANN Backpropagation Algorithm• 4. Calculate error measure and update the weights. Use a recursive algorithm starting at the output neurons (nodes) and working back to the first hidden layer - it is this backward propagation of output errors that inspired the name for this training algorithm. Update the weights W by Prof. M. Kanevski 25
  26. 26. We want to know how to modify weights in order to decrease the error function ∂ E(t)wij (t +1) − wij (t) ∝ − ∂ wij (t) Prof. M. Kanevski 26
  27. 27. Basics of ANN Backpropagation Algorithm m m m (m−1) w (n +1) = w (n) +ηδ Z ij ij i j −1 (m )w n - iteration step, η- rate of learning 0<η≤1), Zj here - output of the j-th neurone in the layer m(m error δi for the output layer is defined byequation -1), Prof. M. Kanevski 27
  28. 28. Basics of ANN Backpropagation Algorithm out out out outδ i = Z (1− Z )(Ti − Z ) i i iδ i ( h −1) = Z (1 − Z ) ∑ w δ i h i h h ij h j j Prof. M. Kanevski 28
  29. 29. Basics of ANN Backpropagation AlgorithmOther error measures (such as maximum absolute error and median squared error) have even greater advantages in many situations. For example, median squared error is useful because unlike the mean the median is a robust statistic - its value is insensitive to occasional large errors in the training data. Unfortunately, practical techniques for implementing these more desirable error measures do not yet exist. Thus, most neural networks today are tied to mean squared error measurements. Prof. M. Kanevski 29
  30. 30. Basics of ANN Backpropagation AlgorithmMore general error functions can be written taking into account (weighting, declustering, economic criteria, etc.) importance of the samples presented to the network : n −1 ∑{ } out 2 E l (W ) = T lj − Z lj (W ) ω lj j=0 Prof. M. Kanevski 30
  31. 31. Gradient descentJ(w) Direction of the gradient J’(W) Minimum w Prof. M. Kanevski 31
  32. 32. Gradient descentJ(w) Minimum w Prof. M. Kanevski 32
  33. 33. In reality the situation with error function and corresponding optimization problem is much more complicated:the presence of multiple local minima! Prof. M. Kanevski 33
  34. 34. Gradient descent Local minima Prof. M. Kanevski 34
  35. 35. SA: IllustrationProf. M. Kanevski 35
  36. 36. How important are local minima? (Duda et al. 2001)In computational practice, we do not want our network to be caught in a local minimum having high training error because this usually indicates that key features of the problem have not been learned by the network.In such cases it is traditional to reinitialize the weights and train again, possibly also altering other parameters in the net Prof. M. Kanevski 36
  37. 37. How important are local minima? (Duda et al. 2001)In many problems, convergence to a nonglobal minimum is acceptable, if the error is nevertheless fairly low. Furthermore, common stopping criteria demand that training terminate even before the minimum is reached, and thus it is not essential that the network be converging toward the global minimum or acceptable performance. Prof. M. Kanevski 37
  38. 38. In shortThe presence of multiple minima does not necessarily present difficulties in training nets, and a few simple heuristics can often overcome such problems (see next slide) Prof. M. Kanevski 38
  39. 39. Practical techniques for improving backpropagation• Activation function (sigmoid, hyperbolic tangent,..)• Scaling inputs• Training with noise (noise injection)• Initializing weights (simulated annealing)• Regularization (weight decay)• Number of hidden layers• Learning parameters (rates, momentum,..)• Cost function• …………………………………. Prof. M. Kanevski 39
  40. 40. Interpretation of network’s outputsConsider the limit in which the size N of the training data set goes to infinity [Bishop 1995]. In this limit we can replace the finite sum over patterns in the sum-of-squares error with an integral of the form N 1 E = lim 2N ∑ ∑ n =1 k { y k ( x n ; w ) − t kn } 2 1 = 2 ∑ ∫∫ { y k k 2 ( x ; w ) − t k } p ( t k , x ) dt k dx Prof. M. Kanevski 40
  41. 41. Interpretation of network’s outputsthe network mapping is given by the conditional average of the target data, the regression of tk conditioned on x. y k ( x ; w *) = 〈 t k | x 〉 Prof. M. Kanevski 41
  42. 42. DEMOProf. M. Kanevski 42
  43. 43. MLP and number of layers• The problem with MLP using single hidden layer is that the neurons tend to interact with each other globally. In complex situations , this interaction makes it difficult to improve the approximation at one point without worsening it at some other point.• On the other hand, with two hidden layers, the approximation process becomes more manageable. Prof. M. Kanevski 43
  44. 44. Two hidden layers! (Haykin)1. Local features are extracted in the first hidden layer. Specifically, some neurons in the first hidden layer are used to partition the input space into regions, and other neurons in that layer learn the local features characterizing those regions.2. Global features are extracted in the second layer. Specifically, a neuron in the second hidden layer combines the outputs of neurons in the first hidden layer operating on a particular region of the input space and thereby learns the global features for that region and outputs zero elsewhere. Prof. M. Kanevski 44
  45. 45. Data Preprocessing• Machine learning Input data algorithms are data- driven methods. Pre-processing• The quality and MLA quantity of data is essential for training and generalization Post-processing Results Prof. M. Kanevski 45
  46. 46. Types of pre-processing:1. Linear and nonlinear transformations e.g input scaling/normalisation, Z-score transform, square root transform, N-score transform, etc.2. Dimensionality reduction3. Incorporate prior knowledge Invariants, hints,…4. Feature extraction linear/nonlinear combination of input variables5. Feature selection decide which features to use Prof. M. Kanevski 46
  47. 47. Dimensionality reduction• Two approaches are available to perform dimensionality reduction:• Feature extraction: creating a subset of new features by combinations of the existing features• Feature selection: choosing a subset of all the features (the ones more informative) Prof. M. Kanevski 47
  48. 48. Feature selection/extraction Prof. M. Kanevski 48
  49. 49. Feature selection• Reducing the feature space by throwing out some of the features (covariates) – Also called variable selection• Motivating idea: try to find a simple, “parsimonious” model (Occam’s razor!) Prof. M. Kanevski 49
  50. 50. Univariate selection may failGuyon-Elisseeff, JMLR 2004; Springer 2006 Prof. M. Kanevski 50
  51. 51. Dimensionality ReductionClearly losing some information but this can be helpful due to curse of dimensionalityNeed some way of deciding what dimensions to keep1. Random choice2. Principal components analysis (PCA)3. Independent components analysis (ICA)4. Self-organised maps (SOM) Prof. M. Kanevski 51
  52. 52. Data transform• Y = aZ+b• Y = Log(Z)• Y = Ind(Z, Zs)• Normalisation: Zscore Y = (Z-Zm)/σ• Box-Cox nonlinear transform : λ Z −1 Y (λ ) = si λ > 0 λ Y (λ = 0) = Ln( Z ) Prof. M. Kanevski 52
  53. 53. Model Selection & Model Evaluation Prof. M. Kanevski 53
  54. 54. Guillaume dOccam (1285 - 1349) “Pluralitas non est ponenda sine necessitate”Occam’s razor:“The more simple explanation of the phenomena is more likely to be correct” Prof. M. Kanevski 54
  55. 55. Model Assessment and Model Selection: Two separate goals Prof. M. Kanevski 55
  56. 56. Model Selection:Estimating the performance of different models in order to choose the (approximate) best one Model Assessment:Having chosen a final model, estimating its prediction error (generalization error) on new data Prof. M. Kanevski 56
  57. 57. If we are in a data-rich situation, the best solution is to split randomly (?) data Raw Data Train: 50% Validation:25% Test:25% (Train) (test) (validation) Prof. M. Kanevski 57
  58. 58. Interpretation• The training set is used to fit the models• The validation set is used to estimate prediction error for model selection (tuning hyperparameters)• The test set is used for assessment of the generalization error of the final chosen model Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001 Prof. M. Kanevski 58
  59. 59. Bias and Variance. Model’s complexity c. Underfitting 32.5 2 b. Overfitting 31.5 2.5 1 20.5 1.5 2 4 6 8 10 1 0.5 2 4 6 8 10 Prof. M. Kanevski 59
  60. 60. One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples.This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task.Solution: Balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error Prof. M. Kanevski 60
  61. 61. Bias-Variance DilemmaAssume that Y = f (X) + ε where E(ε ) = 0, 2 Var(ε ) = σε Prof. M. Kanevski 61
  62. 62. We can derive an expression for the expected prediction error of a regression at an input point X=x0 using squared-error loss: Prof. M. Kanevski 62
  63. 63. ∧ 2Err ( x0 ) = E[(Y − f ( x0 )) ¦ X = x0 ] = ∧ ∧ ∧ 2 2 2σ ε + [ E f ( x0 ) − f ( x0 )] + E[ f ( x0 ) − E f ( x0 )] = ∧ ∧ 2 2σ ε + Bias ( f ( x0 )) + Var ( f ( x0 )) = 2IrreducibleError + Bias + Variance Prof. M. Kanevski 63
  64. 64. • The first term is the variance of the target around its true mean f(x0), and cannot be avoided no matter how well we estimate f(x0), unless σε2=0.• The second term is the squared bias, the amount by which the average of our estimate differs from the true mean• The last term is the variance, the expected ∧ squared deviation of f (x )around its mean. 0 Prof. M. Kanevski 64
  65. 65. Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001 Prof. M. Kanevski 65
  66. 66. Prof. M. Kanevski 66
  67. 67. • A neural network is only as good as the training data!• Poor training data inevitably leads to an unreliable and unpredictable network.• Exploratory Data Analysis and data preprocessing are extremely important!!! Prof. M. Kanevski 67
  68. 68. MLP modelling. Case Studies.Original (10 000 points) Training (900 points) Prof. M. Kanevski 68
  69. 69. MLP modeling Original MLP prediction TrainWhich result do you prefer? RMSE 1.97 Ro 0.69 Prof. M. Kanevski 69
  70. 70. MLP modeling Original MLP predictionWhich result do you prefer? Train RMSE 1.61 Ro 0.80 Prof. M. Kanevski 70
  71. 71. MLP modeling Original MLP predictionWhich result do you prefer? Train RMSE 1.67 Ro 0.79 Prof. M. Kanevski 71
  72. 72. MLP modeling Original MLP prediction TrainWhich result do you prefer? RMSE 1.10 Ro 0.92 Prof. M. Kanevski 72
  73. 73. MLP modeling Original MLP predictionWhich result do you prefer? Train RMSE 0.83 Ro 0.95 Prof. M. Kanevski 73
  74. 74. MLP modeling Original MLP prediction TrainWhich result do you prefer? RMSE 0.55 Ro 0.98 Prof. M. Kanevski 74
  75. 75. MLP modeling 1.00 15-15 20-20 Trainig statistics 0.95 10-10 0.90 5 1.90 0.85 5-5 Ro 1.70 10 10 5-5 0.80 1.50 0.75 1.30 5RMSE 0.70 10-10 1.10 0.65 5 10 5-5 10-10 15-15 20-20 15-15 MLP 0.90 0.70 20-20 0.50 Model 20-20 is the best ? 5 10 5-5 10-10 15-15 20-20 M LP Prof. M. Kanevski 75
  76. 76. MLP modelingTrainig statistics MLP RMSE Ro 5 1.97 0.69 1.61 0.80 10 5-5 1.67 0.79 10-10 1.10 0.92 15-15 0.83 0.95 20-20 0.55 0.98 Prof. M. Kanevski 76
  77. 77. MLP modeling Training &Validation statistics 1.00 Validationg Training 2.10 5 0.95 10-10 1.90 15-15 20-20 0.90 10 5-5 1.70 0.85 1.50 10 5-5 20-20 Ro 0.80RMSE 10-10 15-15 1.30 0.75 1.10 5 0.70 0.90 0.65 0.70 0.60 0.50 5 10 5-5 10-10 15-15 20-20 5 10 5-5 10-10 15-15 20-20 MLP Prof. M. Kanevski MLP 77
  78. 78. MLP modeling Training &Validation statistics 1.00 Validationg Training 2.10 5 0.95 10-10 1.90 15-15 20-20 0.90 10 5-5 1.70 0.85 1.50 10 5-5 20-20 Ro 0.80RMSE 10-10 15-15 1.30 0.75 1.10 5 0.70 0.90 0.65 0.70 0.60 0.50 5 10 5-5 10-10 15-15 20-20 5 10 5-5 10-10 15-15 20-20 MLP Prof. M. Kanevski MLP 78
  79. 79. MLP modelingValidation statistics MLP RMSE Ro 5 2.01 0.68 1.66 0.80 10 5-5 1.70 0.79 10-10 1.25 0.89 15-15 1.24 0.89 20-20 1.39 0.88 Prof. M. Kanevski 79
  80. 80. ANNEX model: Artificial Neural Networks with External drift environmental data mapping Prof. M. Kanevski 80
  81. 81. Traditional application of ANN to spatial predictions Data are available at measurement points: F(xi,yi),for i= 1,…N Problem: Predict F(x,y) at the points withoutmeasurements. Usually regular grid ANN solution: x,y - 2 inputs, F - output- select ANN architecture- train with available data- after training use to predict Prof. M. Kanevski 81
  82. 82. ANNEX is similar to “Kriging with External Drift Model”: If there is an additional information (available at training and prediction points)related to the primary one, we can use it as an additional inputs to the ANN. Inputs: x,y,+fext(x,y) Prof. M. Kanevski 82
  83. 83. Examples of external information• Cheap information on secondary variable Physical model of the phenomena Remotely sensed images GIS data DEM data Prof. M. Kanevski 83
  84. 84. Kriging with external drift Kriging with external drift is the model when trends are limited to E{F(x,y)}=m(x,y) = λ0 +λ1 fext(x,y) (1)where the smooth variability of the secondary variableis considered to be related (e.g., linearly correlated) tothat of primary variable F(x,y) being estimated.In general, kriging with an external drift is a simpleand efficient algorithms to incorporate a secondaryvariable in the estimation of the primary variable. Prof. M. Kanevski 84
  85. 85. ANNEX modelWhat relationship between primary andexternal information should be in case of ANNEX? Prof. M. Kanevski 85
  86. 86. ANNEX model What does external “related” (how to measure: correlation between variables?) information bring? Improved accuracy of prediction? Reduce uncertainty of prediction?An important problem is related to the question of thequality of additional data: there is a dilemma between introducing new information and/or new noise. Prof. M. Kanevski 86
  87. 87. Case study: Kazakh Priaralie, monitoring network 1 400 000 km2 - 400 monitoring stations 87 Prof. M. Kanevski
  88. 88. Datasets GIS DEM model Average long-termtemperatures of air in June (°C) Prof. M. Kanevski 88
  89. 89. CorrelationAir temperature vs. Altitude Prof. M. Kanevski 89
  90. 90. Train and Test datasets Train Test Prof. M. Kanevski 90
  91. 91. ANN and ANNEX models Model Correlation RMSE MAE MRE 2-7-5-1 0.917 2.57 1.96 -0.02 3-3-1 0.989 0.96 0.73 -0.01 3-5-1 0.99 0.9 0.7 -0.007 3-7-1 0.991 0.85 0.66 -0.004 3-8-1 0.991 0.84 0.68 -0.001 3-9-1 0.991 0.88 0.69 -0.01 3-10-1 0.99 0.92 0.74 -0.01Kriging with 0.984 1.19 0.91 -0.03external drift Prof. M. Kanevski 91
  92. 92. Scatter plotsKriging Cokriging Drift ANNEX Kriging Prof. M. Kanevski 92
  93. 93. Mapping results Kriging Cokriging Drift ANNEXKriging Prof. M. Kanevski 93
  94. 94. Modelling noisy “altitude” effect (100 %)Before After Prof. M. Kanevski 94
  95. 95. Scatter plots between variables (noisy 100 % altitude)Train Test 95 Prof. M. Kanevski
  96. 96. Mapping noise results ANNEXAir temperature (°C) Prof. M. Kanevski 96
  97. 97. Noise results Model Correlation RMSE MAE MRE Kriging 0.874 3.13 2.04 -0.06Kriging – external drift 0.984 1.19 0.91 -0.03 3-7-1 0.991 0.85 0.66 -0.004 3-8-1 0.991 0.84 0.68 -0.001 3-8-1 0.839 3.54 2.37 -0.13 (100% noise) 3-7-1 0.939 2.32 -1.49 -0.003 (10% noise) Test 1Kriging – external drift 0.941 2.23 1.54 -0.06 (10% noise) Test 1 3-7-1 0.899 2.81 1.52 -0.08 (10% noise) Test 2Kriging – external drift 0.903 2.81 1.59 -0.103 (10% noise) Test 2 Prof. M. Kanevski 97
  98. 98. MLP: real case studyWind fields in Switzerland Prof. M. Kanevski 98
  99. 99. Modeling of wind fields with MLP using regularization technique (pp 168-172 of the book)Monitoring network:111 stations in Switzerland(80 training + 31 for validation)Mapping of daily:• Mean speed• Maximum gust• Average direction Prof. M. Kanevski 99
  100. 100. Modeling of wind fields with MLP and regularization techniqueMonitoring network:111 stations in Switzerland (80 training + 31 for validation)Mapping of daily:• Mean speed• Maximum gust• Average directionInput information:X,Y geographical coordinatesDEM (resolution 500 m)23 DEM-based « geo-features » Total 26 featuresModel:MLP 26-20-20-3 Prof. M. Kanevski 100
  101. 101. Training of the MLPModel:MLP 26-20-20-3Training:• Random initialization• 500 iterations of theRPROP algorithm Prof. M. Kanevski 101
  102. 102. Results: naîve approach Prof. M. Kanevski 102
  103. 103. Results: Noisy ejection regularization Prof. M. Kanevski 103
  104. 104. Results: summary Noisy ejection regularizationWithout regularization (overfitting) Prof. M. Kanevski 104
  105. 105. Conclusion• MLP is a nonlinear universal tool for the learning from and modeling of data. Excellent exploratory tool.• Application demands deep expert knowledge and experience Prof. M. Kanevski 105

×