Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Non-linear decision boundaries
Linear models can only produce linear
Real world data often needs a non-linear
Learn a suitable representation space from the data by using Deep neural Networks
AND and OR can be generated with a single perceptron
!" = $ %&' + ) = *( 2 2 ·
− 3) !/ = $ %&' + ) = *( 2 2 ·
X-OR a Non-linear separable problem can not be
generated with a single perceptron
Example: X-OR. Finally
ℎ" = $ %&&
( + *"" = +( −2 2 ·
ℎ1 = $ %&4
( + *"1 = +( 2 −2 ·
5 = $ %4
6 + *1 = +( 2 2 ·
Three layer Network:
2-2-1 Fully connected topology
(all neurons in a layer connected
Connected to all neurons in the
Another Example: Star Region (Univ. Texas)
A neural network is simply a composition of
simple neurons into several layers
Each neuron simply computes a linear
combination of its inputs, adds a bias, and
passes the result through an activation
The network can contain one or more hidden
layers. The outputs of these hidden layers can
be thought of as a new representation of the
data (new features).
The final output is the target variable (y = f(x))
When each node in each layer is a linear
combination of all inputs from the previous
layer then the network is called a multilayer
Weights can be organized into matrices.
Forward pass computes
g: activation function. i.e. sigmoid f : target function. i.e. softmax
y = f(x)
Universal approximation theorem
Universal approximation theorem states that “the standard multilayer feed-forward network with a single hidden layer,
which contains finite number of hidden neurons, is a universal approximator among continuous functions on compact
subsets of Rn, under mild assumptions on the activation function.”
If a 2 layer NN is a universal approximator, then why do we need deep nets??
The universal approximation theorem:
Says nothing about the how easy/difficult it is to fit such approximators
Needs a “finite number of hidden neurons”: finite may be extremely large
In practice, deep nets can usually represent more complex functions with less total neurons (and
therefore, less parameters)
Linear regression – Loss Function
Loss function is square (Euclidean) loss
Activation function is the sigmoid
Loss function is cross entropy
g(wTx + b) = ½
g(wTx + b) > ½
g(wTx + b) < ½
Fitting linear models
E.g. linear regression
Need to optimize L
a : learning rate (aka step size)
Estimate parameters !(W(t), b(t)) from training examples given a Loss Function
= %&'()*+ℒ -+ . , 0
• Iteratively adapt each parameter
Basic idea: gradient descent.
• Dependencies are very complex.
Global minimum: challenging. Local minima: can be good enough.
• Initialization influences in the solutions.
Gradient Descent: Move the parameter ! in small steps in the direction opposite sign of the
derivative of the loss with respect !.
!(#) = !(#&') − )(#&') * +,ℒ(., 0 1 )
Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a
minibatch of examples.
Several strategies have been proposed to update the weights: Adam, RMSProp, Adamax, etc.
known as: optimizers
• 60.000 training examples
• 10.000 test examples
• 10 classes (digits 0-9)
• 28x28 grayscale images(784 pixels)
The objective is to learn a function that predicts the digit from the image
• 3 layer neural-network ( 2 hidden layers)
• Tanh units (activation function)
• Softmax on top layer
• Cross entropy Loss
• 40 epochs using min-batch SGD
• Size of the mini-batch: 128
• Leaning Rate: 0.1 (fixed)
• Takes 5 minutes to train on GPU
• 98.12% (188 errors in 10.000 test examples)
there are ways to improve accuracy…
() + (+
() + (+ + ,) + ,+
there are other metrics….
• Multilayer Perceptron Networks allow us to build non-linear decision boundaries
• Multilayer Perceptron Networks are composed of the input layer, hidden layers and the
output layer. All neurons in one layer are connected to all neurons from the previous
layer and the layer that follows
• Multilayer Perceptron Networks have a large number of parameters that have to be
estimated trough training with the goal of minimizing a given loss function
• With Multiple Layer Perceptrons we need to find the gradient of the loss function with
respect to all the parameters of the model (W(t), b(t))
Given the following network to obtain a XNOR operation, Indicate which parameters are correct:
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=2,w221=2,b2=-1
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=2,w221=2,b2=1
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=-2,w221=-2,b2=1
● w111=-2, w112=2, w121=2,w122=-2,b1=-1,w211=-2,w221=-2,b2=-1
Given the following Fully Connected Network, with an input of 256 elements, 2 hidden layers and an
output layer, how many parameters do you need to estimate ?