Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018

[course site]
Day 2 Lecture 1
Multilayer Perceptron
Elisa Sayrol

Acknowledgements
Antonio Bonafonte
Kevin McGuinness
kevin.mcguinness@dcu.ie
Research Fellow
Insight Centre for Data Analytics
Dublin City University
Xavier Giro-i-Nieto
xavier.giro@upc.edu

Single Neuron Model (Perceptron)
The perceptron can address both regression or classification problems, depending
on the chosen activation function

Linear Regression (eg. 1D input - 1D ouput)
! " = $%
" + '

Binary Classification (eg. 2D input, 1D ouput)
MultiClass: Softmax
! " = $ " = $(&'" + ))Sigmoid

Non-linear decision boundaries
Linear models can only produce linear
decision boundaries
Real world data often needs a non-linear
decision boundary
Images
Audio
Text
Learn a suitable representation space from the data by using Deep neural Networks

Example: X-OR.
AND and OR can be generated with a single perceptron
g
-3
x1
x2
2
2
y1
x1
x2 AND
0
0
1
1
g
-1
x1
x2
2
2
y2
OR
0
0
x2
1
x1
1
!" = $ %&' + ) = *( 2 2 ·
."
./
− 3) !/ = $ %&' + ) = *( 2 2 ·
."
./
− 1)
Input vector
(x1,x2)
Class
OR
(0,0) 0
(0,1) 1
(1,0) 1
(1,1) 1
Input vector
(x1,x2)
Class
AND
(0,0) 0
(0,1) 0
(1,0) 0
(1,1) 1

Example: X-OR
X-OR a Non-linear separable problem can not be
generated with a single perceptron
XOR
0
0
x2
1
x1
1
Input vector
(x1,x2)
Class
XOR
(0,0) 0
(0,1) 1
(1,0) 1
(1,1) 0

g
-1
x1
x2
-2
2
h1
x1
x2
0
0
1
1
g
-1
x1
x2
2
-2
h2
0
0
x2
1
x1
1
h1
g
-1
h1
h2
2
2
y
0
h2
(0,0)
(1,1)
(0,1)
(1,0)
Example: X-OR. However…..
ℎ" = $ %&&
'
( + *"" = +( −2 2 ·
0"
01
− 1)
ℎ1 = $ %&4
'
( + *"1 = +( 2 −2 ·
0"
01
− 1)
5 = $ %4
'
6 + *1 = +( 2 2 ·
ℎ"
ℎ1
− 1)

g
-1
x1
x2
-2
2
h1
x1
x2
0
0
1
1
g
-1
x1
x2
2
-2
h2
0
0
x2
1
x1
1
g
-1
h1
h2
2
2
y
h1
0
h2
(0,0)
(1,1)
(0,1)
(1,0)
Input vector
(x1,x2)
h1
(0,0) 0
(0,1) 1
(1,0) 0
(1,1) 0
Input vector
(x1,x2)
h1
(0,0) 0
(0,1) 0
(1,0) 1
(1,1) 0
Input vector
(h1,h2)
y1
(0,0) 0
(0,1) 1
(1,0) 1
Example: X-OR. However…..
Input vector
(x1,x2)
y1
(0,0) 0
(0,1) 1
(1,0) 1
(1,1) 0

Example: X-OR. Finally
x1
x2
0
0
1
1
ℎ" = $ %&&
'
( + *"" = +( −2 2 ·
0"
01
− 1)
ℎ1 = $ %&4
'
( + *"1 = +( 2 −2 ·
0"
01
− 1)
5 = $ %4
'
6 + *1 = +( 2 2 ·
ℎ"
ℎ1
− 1)
g h1
g
-1
x1
x2
2
-2
h2
2
-2
g
-1
Input
layer
Hidden
layer
Output
Layer
y
Three layer Network:
-Input Layer
-Hidden Layer
-Output Layer
2-2-1 Fully connected topology
(all neurons in a layer connected
Connected to all neurons in the
following layer)

Another Example: Star Region (Univ. Texas)
https://www.cs.utexas.edu/~teammco/misc/mlp/

Neural networks
A neural network is simply a composition of
simple neurons into several layers
Each neuron simply computes a linear
combination of its inputs, adds a bias, and
passes the result through an activation
function g(x)
The network can contain one or more hidden
layers. The outputs of these hidden layers can
be thought of as a new representation of the
data (new features).
The final output is the target variable (y = f(x))

Multilayer perceptrons
When each node in each layer is a linear
combination of all inputs from the previous
layer then the network is called a multilayer
perceptron (MLP)
Weights can be organized into matrices.
Forward pass computes
Depth
Width
!(#)
=g(%(#)
!(&)
+((#)
)
g: activation function. i.e. sigmoid f : target function. i.e. softmax
Fully
connected
Network
y = f(x)

16
w11 w12 w13 w14
w21 w22 w23 w24
w31 w32 w33 w34
w41 w42 w43 w44
W1
x1
x2
x3
x4
b1
b2
b3
b4
b1h0
h11= g( wx +b )
x1 x2 x3 x4
y1
Layer 1
Layer 2
Layer 3
Layer 0
y2
h0
h1
h2
h3

17
w11 w12 w13 w14
w21 w22 w23 w24
w31 w32 w33 w34
w41 w42 w43 w44
W1
x1
x2
x3
x4
b1
b2
b3
b4
b1h0
x1 x2 x3 x4
y1
Layer 1
Layer 2
Layer 3
Layer 0
y2
h0
h1
h2
h3
h11= g( wx +b )
h12= g( wx +b )

Universal approximation theorem
Universal approximation theorem states that “the standard multilayer feed-forward network with a single hidden layer,
which contains finite number of hidden neurons, is a universal approximator among continuous functions on compact
subsets of Rn, under mild assumptions on the activation function.”
If a 2 layer NN is a universal approximator, then why do we need deep nets??
The universal approximation theorem:
Says nothing about the how easy/difficult it is to fit such approximators
Needs a “finite number of hidden neurons”: finite may be extremely large
In practice, deep nets can usually represent more complex functions with less total neurons (and
therefore, less parameters)

Linear regression – Loss Function
y
x
Loss function is square (Euclidean) loss

Logistic regression
Activation function is the sigmoid
Loss function is cross entropy
x2
x1
g(wTx + b) = ½
w
g(wTx + b) > ½
g(wTx + b) < ½
1
0

Fitting linear models
E.g. linear regression
Need to optimize L
Gradient descent
w
L
Tangent lineLoss
function
wt
wt+1
a : learning rate (aka step size)

Training
Estimate parameters !(W(t), b(t)) from training examples given a Loss Function
"
∗
= %&'()*+ℒ -+ . , 0
• Iteratively adapt each parameter
Basic idea: gradient descent.
• Dependencies are very complex.
Global minimum: challenging. Local minima: can be good enough.
• Initialization influences in the solutions.

Training
Gradient Descent: Move the parameter ! in small steps in the direction opposite sign of the
derivative of the loss with respect !.
!(#) = !(#&') − )(#&') * +,ℒ(., 0 1 )
Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a
minibatch of examples.
Several strategies have been proposed to update the weights: Adam, RMSProp, Adamax, etc.
known as: optimizers

Gradient descent examples
Linear regression
http://nbviewer.jupyter.org/github/kevinmcguinness/ml-examples/blob/master/notebooks/GD_Regression.ipynb
https://github.com/kevinmcguinness/ml-examples/blob/master/notebooks/GD_Regression.ipynb
Logistic regression
http://nbviewer.jupyter.org/github/kevinmcguinness/ml-examples/blob/master/notebooks/GD_Classification.ipynb
https://github.com/kevinmcguinness/ml-examples/blob/master/notebooks/GD_Classification.ipynb

MNIST Example
Handwritten digits
• 60.000 training examples
• 10.000 test examples
• 10 classes (digits 0-9)
• 28x28 grayscale images(784 pixels)
• http://yann.lecun.com/exdb/mnist/
The objective is to learn a function that predicts the digit from the image

MNIST Example
Model
• 3 layer neural-network ( 2 hidden layers)
• Tanh units (activation function)
• 512-512-10
• Softmax on top layer
• Cross entropy Loss

MNIST Example
Training
• 40 epochs using min-batch SGD
• Size of the mini-batch: 128
• Leaning Rate: 0.1 (fixed)
• Takes 5 minutes to train on GPU
Accuracy Results
• 98.12% (188 errors in 10.000 test examples)
there are ways to improve accuracy…
Metrics
!""#$%"& =
() + (+
() + (+ + ,) + ,+
there are other metrics….

Summary
• Multilayer Perceptron Networks allow us to build non-linear decision boundaries
• Multilayer Perceptron Networks are composed of the input layer, hidden layers and the
output layer. All neurons in one layer are connected to all neurons from the previous
layer and the layer that follows
• Multilayer Perceptron Networks have a large number of parameters that have to be
estimated trough training with the goal of minimizing a given loss function
• With Multiple Layer Perceptrons we need to find the gradient of the loss function with
respect to all the parameters of the model (W(t), b(t))

30
Assignment D2L2.1
Given the following network to obtain a XNOR operation, Indicate which parameters are correct:
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=2,w221=2,b2=-1
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=2,w221=2,b2=1
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=-2,w221=-2,b2=1
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=-2,w221=-2,b2=-1

31
Assignment D2L1.2
Given the following Fully Connected Network, with an input of 256 elements, 2 hidden layers and an
output layer, how many parameters do you need to estimate ?

Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018

Similar to Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018 (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018