https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
7. Non-linear decision boundaries
Linear models can only produce linear
decision boundaries
Real world data often needs a non-linear
decision boundary
Images
Audio
Text
Learn a suitable representation space from the data by using Deep neural Networks
8. Example: X-OR.
AND and OR can be generated with a single perceptron
g
-3
x1
x2
2
2
y1
x1
x2 AND
0
0
1
1
g
-1
x1
x2
2
2
y2
OR
0
0
x2
1
x1
1
!" = $ %&' + ) = *( 2 2 ·
."
./
− 3) !/ = $ %&' + ) = *( 2 2 ·
."
./
− 1)
Input vector
(x1,x2)
Class
OR
(0,0) 0
(0,1) 1
(1,0) 1
(1,1) 1
Input vector
(x1,x2)
Class
AND
(0,0) 0
(0,1) 0
(1,0) 0
(1,1) 1
9. Example: X-OR
X-OR a Non-linear separable problem can not be
generated with a single perceptron
XOR
0
0
x2
1
x1
1
Input vector
(x1,x2)
Class
XOR
(0,0) 0
(0,1) 1
(1,0) 1
(1,1) 0
12. Example: X-OR. Finally
x1
x2
0
0
1
1
ℎ" = $ %&&
'
( + *"" = +( −2 2 ·
0"
01
− 1)
ℎ1 = $ %&4
'
( + *"1 = +( 2 −2 ·
0"
01
− 1)
5 = $ %4
'
6 + *1 = +( 2 2 ·
ℎ"
ℎ1
− 1)
g h1
g
-1
x1
x2
2
-2
h2
2
-2
g
-1
Input
layer
Hidden
layer
Output
Layer
y
Three layer Network:
-Input Layer
-Hidden Layer
-Output Layer
2-2-1 Fully connected topology
(all neurons in a layer connected
Connected to all neurons in the
following layer)
13. Another Example: Star Region (Univ. Texas)
https://www.cs.utexas.edu/~teammco/misc/mlp/
14. Neural networks
A neural network is simply a composition of
simple neurons into several layers
Each neuron simply computes a linear
combination of its inputs, adds a bias, and
passes the result through an activation
function g(x)
The network can contain one or more hidden
layers. The outputs of these hidden layers can
be thought of as a new representation of the
data (new features).
The final output is the target variable (y = f(x))
15. Multilayer perceptrons
When each node in each layer is a linear
combination of all inputs from the previous
layer then the network is called a multilayer
perceptron (MLP)
Weights can be organized into matrices.
Forward pass computes
Depth
Width
!(#)
=g(%(#)
!(&)
+((#)
)
g: activation function. i.e. sigmoid f : target function. i.e. softmax
Fully
connected
Network
y = f(x)
18. Universal approximation theorem
Universal approximation theorem states that “the standard multilayer feed-forward network with a single hidden layer,
which contains finite number of hidden neurons, is a universal approximator among continuous functions on compact
subsets of Rn, under mild assumptions on the activation function.”
If a 2 layer NN is a universal approximator, then why do we need deep nets??
The universal approximation theorem:
Says nothing about the how easy/difficult it is to fit such approximators
Needs a “finite number of hidden neurons”: finite may be extremely large
In practice, deep nets can usually represent more complex functions with less total neurons (and
therefore, less parameters)
22. Fitting linear models
E.g. linear regression
Need to optimize L
Gradient descent
w
L
Tangent lineLoss
function
wt
wt+1
a : learning rate (aka step size)
23. Training
Estimate parameters !(W(t), b(t)) from training examples given a Loss Function
"
∗
= %&'()*+ℒ -+ . , 0
• Iteratively adapt each parameter
Basic idea: gradient descent.
• Dependencies are very complex.
Global minimum: challenging. Local minima: can be good enough.
• Initialization influences in the solutions.
24. Training
Gradient Descent: Move the parameter ! in small steps in the direction opposite sign of the
derivative of the loss with respect !.
!(#) = !(#&') − )(#&') * +,ℒ(., 0 1 )
Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a
minibatch of examples.
Several strategies have been proposed to update the weights: Adam, RMSProp, Adamax, etc.
known as: optimizers
26. MNIST Example
Handwritten digits
• 60.000 training examples
• 10.000 test examples
• 10 classes (digits 0-9)
• 28x28 grayscale images(784 pixels)
• http://yann.lecun.com/exdb/mnist/
The objective is to learn a function that predicts the digit from the image
27. MNIST Example
Model
• 3 layer neural-network ( 2 hidden layers)
• Tanh units (activation function)
• 512-512-10
• Softmax on top layer
• Cross entropy Loss
28. MNIST Example
Training
• 40 epochs using min-batch SGD
• Size of the mini-batch: 128
• Leaning Rate: 0.1 (fixed)
• Takes 5 minutes to train on GPU
Accuracy Results
• 98.12% (188 errors in 10.000 test examples)
there are ways to improve accuracy…
Metrics
!""#$%"& =
() + (+
() + (+ + ,) + ,+
there are other metrics….
29. Summary
• Multilayer Perceptron Networks allow us to build non-linear decision boundaries
• Multilayer Perceptron Networks are composed of the input layer, hidden layers and the
output layer. All neurons in one layer are connected to all neurons from the previous
layer and the layer that follows
• Multilayer Perceptron Networks have a large number of parameters that have to be
estimated trough training with the goal of minimizing a given loss function
• With Multiple Layer Perceptrons we need to find the gradient of the loss function with
respect to all the parameters of the model (W(t), b(t))
30. 30
Assignment D2L2.1
Given the following network to obtain a XNOR operation, Indicate which parameters are correct:
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=2,w221=2,b2=-1
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=2,w221=2,b2=1
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=-2,w221=-2,b2=1
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=-2,w221=-2,b2=-1
31. 31
Assignment D2L1.2
Given the following Fully Connected Network, with an input of 256 elements, 2 hidden layers and an
output layer, how many parameters do you need to estimate ?