Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018

69 views

Published on

https://telecombcn-dl.github.io/2018-dlai/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018

  1. 1. [course site] Day 2 Lecture 1 Multilayer Perceptron Elisa Sayrol
  2. 2. Acknowledgements Antonio Bonafonte Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University Xavier Giro-i-Nieto xavier.giro@upc.edu
  3. 3. …in our last lecture
  4. 4. Single Neuron Model (Perceptron) The perceptron can address both regression or classification problems, depending on the chosen activation function
  5. 5. Linear Regression (eg. 1D input - 1D ouput) ! " = $% " + '
  6. 6. Binary Classification (eg. 2D input, 1D ouput) MultiClass: Softmax ! " = $ " = $(&'" + ))Sigmoid
  7. 7. Non-linear decision boundaries Linear models can only produce linear decision boundaries Real world data often needs a non-linear decision boundary Images Audio Text Learn a suitable representation space from the data by using Deep neural Networks
  8. 8. Example: X-OR. AND and OR can be generated with a single perceptron g -3 x1 x2 2 2 y1 x1 x2 AND 0 0 1 1 g -1 x1 x2 2 2 y2 OR 0 0 x2 1 x1 1 !" = $ %&' + ) = *( 2 2 · ." ./ − 3) !/ = $ %&' + ) = *( 2 2 · ." ./ − 1) Input vector (x1,x2) Class OR (0,0) 0 (0,1) 1 (1,0) 1 (1,1) 1 Input vector (x1,x2) Class AND (0,0) 0 (0,1) 0 (1,0) 0 (1,1) 1
  9. 9. Example: X-OR X-OR a Non-linear separable problem can not be generated with a single perceptron XOR 0 0 x2 1 x1 1 Input vector (x1,x2) Class XOR (0,0) 0 (0,1) 1 (1,0) 1 (1,1) 0
  10. 10. g -1 x1 x2 -2 2 h1 x1 x2 0 0 1 1 g -1 x1 x2 2 -2 h2 0 0 x2 1 x1 1 h1 g -1 h1 h2 2 2 y 0 h2 (0,0) (1,1) (0,1) (1,0) Example: X-OR. However….. ℎ" = $ %&& ' ( + *"" = +( −2 2 · 0" 01 − 1) ℎ1 = $ %&4 ' ( + *"1 = +( 2 −2 · 0" 01 − 1) 5 = $ %4 ' 6 + *1 = +( 2 2 · ℎ" ℎ1 − 1)
  11. 11. g -1 x1 x2 -2 2 h1 x1 x2 0 0 1 1 g -1 x1 x2 2 -2 h2 0 0 x2 1 x1 1 g -1 h1 h2 2 2 y h1 0 h2 (0,0) (1,1) (0,1) (1,0) Input vector (x1,x2) h1 (0,0) 0 (0,1) 1 (1,0) 0 (1,1) 0 Input vector (x1,x2) h1 (0,0) 0 (0,1) 0 (1,0) 1 (1,1) 0 Input vector (h1,h2) y1 (0,0) 0 (0,1) 1 (1,0) 1 Example: X-OR. However….. Input vector (x1,x2) y1 (0,0) 0 (0,1) 1 (1,0) 1 (1,1) 0
  12. 12. Example: X-OR. Finally x1 x2 0 0 1 1 ℎ" = $ %&& ' ( + *"" = +( −2 2 · 0" 01 − 1) ℎ1 = $ %&4 ' ( + *"1 = +( 2 −2 · 0" 01 − 1) 5 = $ %4 ' 6 + *1 = +( 2 2 · ℎ" ℎ1 − 1) g h1 g -1 x1 x2 2 -2 h2 2 -2 g -1 Input layer Hidden layer Output Layer y Three layer Network: -Input Layer -Hidden Layer -Output Layer 2-2-1 Fully connected topology (all neurons in a layer connected Connected to all neurons in the following layer)
  13. 13. Another Example: Star Region (Univ. Texas) https://www.cs.utexas.edu/~teammco/misc/mlp/
  14. 14. Neural networks A neural network is simply a composition of simple neurons into several layers Each neuron simply computes a linear combination of its inputs, adds a bias, and passes the result through an activation function g(x) The network can contain one or more hidden layers. The outputs of these hidden layers can be thought of as a new representation of the data (new features). The final output is the target variable (y = f(x))
  15. 15. Multilayer perceptrons When each node in each layer is a linear combination of all inputs from the previous layer then the network is called a multilayer perceptron (MLP) Weights can be organized into matrices. Forward pass computes Depth Width !(#) =g(%(#) !(&) +((#) ) g: activation function. i.e. sigmoid f : target function. i.e. softmax Fully connected Network y = f(x)
  16. 16. Multilayer perceptrons 16 Forward pass computes w11 w12 w13 w14 w21 w22 w23 w24 w31 w32 w33 w34 w41 w42 w43 w44 W1 x1 x2 x3 x4 b1 b2 b3 b4 b1h0 h11= g( wx +b ) x1 x2 x3 x4 y1 Layer 1 Layer 2 Layer 3 Layer 0 y2 h0 h1 h2 h3
  17. 17. Multilayer perceptrons 17 Forward pass computes w11 w12 w13 w14 w21 w22 w23 w24 w31 w32 w33 w34 w41 w42 w43 w44 W1 x1 x2 x3 x4 b1 b2 b3 b4 b1h0 x1 x2 x3 x4 y1 Layer 1 Layer 2 Layer 3 Layer 0 y2 h0 h1 h2 h3 h11= g( wx +b ) h12= g( wx +b )
  18. 18. Universal approximation theorem Universal approximation theorem states that “the standard multilayer feed-forward network with a single hidden layer, which contains finite number of hidden neurons, is a universal approximator among continuous functions on compact subsets of Rn, under mild assumptions on the activation function.” If a 2 layer NN is a universal approximator, then why do we need deep nets?? The universal approximation theorem: Says nothing about the how easy/difficult it is to fit such approximators Needs a “finite number of hidden neurons”: finite may be extremely large In practice, deep nets can usually represent more complex functions with less total neurons (and therefore, less parameters)
  19. 19. …Learning
  20. 20. Linear regression – Loss Function y x Loss function is square (Euclidean) loss
  21. 21. Logistic regression Activation function is the sigmoid Loss function is cross entropy x2 x1 g(wTx + b) = ½ w g(wTx + b) > ½ g(wTx + b) < ½ 1 0
  22. 22. Fitting linear models E.g. linear regression Need to optimize L Gradient descent w L Tangent lineLoss function wt wt+1 a : learning rate (aka step size)
  23. 23. Training Estimate parameters !(W(t), b(t)) from training examples given a Loss Function " ∗ = %&'()*+ℒ -+ . , 0 • Iteratively adapt each parameter Basic idea: gradient descent. • Dependencies are very complex. Global minimum: challenging. Local minima: can be good enough. • Initialization influences in the solutions.
  24. 24. Training Gradient Descent: Move the parameter ! in small steps in the direction opposite sign of the derivative of the loss with respect !. !(#) = !(#&') − )(#&') * +,ℒ(., 0 1 ) Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a minibatch of examples. Several strategies have been proposed to update the weights: Adam, RMSProp, Adamax, etc. known as: optimizers
  25. 25. Gradient descent examples Linear regression http://nbviewer.jupyter.org/github/kevinmcguinness/ml-examples/blob/master/notebooks/GD_Regression.ipynb https://github.com/kevinmcguinness/ml-examples/blob/master/notebooks/GD_Regression.ipynb Logistic regression http://nbviewer.jupyter.org/github/kevinmcguinness/ml-examples/blob/master/notebooks/GD_Classification.ipynb https://github.com/kevinmcguinness/ml-examples/blob/master/notebooks/GD_Classification.ipynb
  26. 26. MNIST Example Handwritten digits • 60.000 training examples • 10.000 test examples • 10 classes (digits 0-9) • 28x28 grayscale images(784 pixels) • http://yann.lecun.com/exdb/mnist/ The objective is to learn a function that predicts the digit from the image
  27. 27. MNIST Example Model • 3 layer neural-network ( 2 hidden layers) • Tanh units (activation function) • 512-512-10 • Softmax on top layer • Cross entropy Loss
  28. 28. MNIST Example Training • 40 epochs using min-batch SGD • Size of the mini-batch: 128 • Leaning Rate: 0.1 (fixed) • Takes 5 minutes to train on GPU Accuracy Results • 98.12% (188 errors in 10.000 test examples) there are ways to improve accuracy… Metrics !""#$%"& = () + (+ () + (+ + ,) + ,+ there are other metrics….
  29. 29. Summary • Multilayer Perceptron Networks allow us to build non-linear decision boundaries • Multilayer Perceptron Networks are composed of the input layer, hidden layers and the output layer. All neurons in one layer are connected to all neurons from the previous layer and the layer that follows • Multilayer Perceptron Networks have a large number of parameters that have to be estimated trough training with the goal of minimizing a given loss function • With Multiple Layer Perceptrons we need to find the gradient of the loss function with respect to all the parameters of the model (W(t), b(t))
  30. 30. 30 Assignment D2L2.1 Given the following network to obtain a XNOR operation, Indicate which parameters are correct: ● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=2,w221=2,b2=-1 ● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=2,w221=2,b2=1 ● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=-2,w221=-2,b2=1 ● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=-2,w221=-2,b2=-1
  31. 31. 31 Assignment D2L1.2 Given the following Fully Connected Network, with an input of 256 elements, 2 hidden layers and an output layer, how many parameters do you need to estimate ?

×