SlideShare a Scribd company logo
ECE 285
Machine Learning for Image Processing
Chapter II – Preliminaries to deep learning
Charles Deledalle
March 11, 2019
1
Deep learning – What is deep learning?
What is deep learning?
• Part of the machine learning field of learning representations of data.
Exceptionally effective at learning patterns.
• Utilizes learning algorithms that derive meaning out of data by using a
hierarchy of multiple layers that mimic the neural networks of our brain.
• If you provide the system tons of information, it begins to understand it
and respond in useful ways.
• Rebirth of artificial neural networks.
(Source: Lucas Masuch)
2
Machine learning – Brief history
Brief history
• First wave
• 1943. McCulloch and Pitts proposed the first neural model,
• 1958. Rosenblath introduced the Perceptron,
• 1969. Minsky and Papert’s book demonstrated the limitation of single
layer perceptrons, and almost the whole field went into hibernation.
• Second wave
• 1986. Backpropagation learning algorithm was rediscovered,
• 1989. Yann LeCun (re)introduced Convolutional Neural Networks,
• 1998. Breakthrough of CNNs in recognizing hand-written digits.
• Third wave
• 2006. Deep (neural network) learning gains popularity,
• 2012. Made significant breakthroughs in many applications,
• 2015. AlphaGo first program to beat a professional Go player.
(Source: Jun Wang)
3
Deep learning – Interest
Google NGRAM & Google Trends
(Source: Lucas Masuch) 4
Machine learning – Timeline
Timeline of (deep) learning
1974 Backpropagation
1995
SVM reigns
Convolution Neural Networks for
Handwritten Recognition
1998
2006
Restricted
Boltzmann
Machine
1958 Perceptron
1969
Perceptron criticized
Google Brain Project on
16k Cores
2012
2012
AlexNet wins
ImageNet
Perceptrons
book
~1980
Multilayer
network Support Vector Machines
feature 1
feature2
Support Vectors
Maximal
Margin
HyperplaneX1
X2
X3
X4
W11
W12
W13
W14
f(x)
Input Weights Sum
Activation
Function
arti cial
Neuron
awkward silence (AI winter)
(Source: Lucas Masuch & Vincent Lepetit) 5
Perceptron
Machine learning – ANN - Backpropagation
Perceptron
1958 Perceptron
1969
Perceptron criticized
Perceptrons
book
X1
X2
X3
X4
W11
W12
W13
W14
f(x)
Input Weights Sum
Activation
Function
arti cial
Neuron
(Source: Lucas Masuch & Vincent Lepetit) 6
Machine learning – Perceptron
Perceptron (Frank Rosenblatt, 1958)
First binary classifier based on supervised learning (discrimination).
Foundation of modern artificial neural networks.
At that time: technological, scientific and philosophical challenges.
7
Machine learning – Perceptron – Representation
Representation of the Perceptron
Parameters of the perceptron
• wk: synaptic weights
• b: bias
←− real parameters to be estimated.
Training = adjusting the weights and biases
8
Machine learning – Perceptron – Inspiration
The origin of the Perceptron
Takes inspiration from the visual system known for its ability to learn patterns.
• When a neuron receives a stimulus
with high enough voltage, it emits
an action potential (aka, nerve
impulse or spike). It is said to fire.
• The perceptron mimics this
activation effect: it fires only when
i
wixi + b > 0
y = sign(w0x0 + w1x1 + w2x2 + w3x3 + b)
f(x;w)
=
+1 for the first class
−1 for the second class
9
Machine learning – Perceptron – Principle
1 Data are represented as vectors:
2 Collect training data with positive and negative examples:
(Source: Vincent Lepetit) 10
Machine learning – Perceptron – Principle
Dot product:
w, x =
d
i=1
wixi
= wT
x
3 Training: find w and b so that:
• w, x + b is positive for positive samples x,
• w, x + b is negative for negative samples x.
(Source: Vincent Lepetit) 10
Machine learning – Perceptron – Principle
Dot product:
w, x =
d
i=1
wixi
= wT
x
3 Training: find w and b so that:
• w, x + b is positive for positive samples x,
• w, x + b is negative for negative samples x.
The equation w, x + b = 0 defines a hyperplane.
The hyperplane acts as a linear separator.
w is a normal vector to the hyperplane.
(Source: Vincent Lepetit) 10
Machine learning – Perceptron – Principle
4 Testing: the perceptron can now classify new examples.
(Source: Vincent Lepetit) 10
Machine learning – Perceptron – Principle
4 Testing: the perceptron can now classify new examples.
• A new example x is classified positive if w, x + b is positive,
(Source: Vincent Lepetit) 10
Machine learning – Perceptron – Principle
4 Testing: the perceptron can now classify new examples.
• A new example x is classified positive if w, x + b is positive,
• and negative if w, x + b is negative.
(Source: Vincent Lepetit) 10
Machine learning – Perceptron – Principle
4 Testing: the perceptron can now classify new examples.
• A new example x is classified positive if w, x + b is positive,
• and negative if w, x + b is negative.
(signed) distance of x to the hyperplane:
r =
w, x + b
||w||
(Source: Vincent Lepetit) 10
Machine learning – Perceptron – Representation
Alternative representation
Use the zero-index to encode the bias as a synaptic weight.
Simplifies algorithms as all parameters can now be treated in the same way.
11
Machine learning – Perceptron – Training
Perceptron algorithm
Goal: find the vector of weights w from a labeled training dataset T
How: minimize classification errors
min
w
E(w) = −
(x,d)∈T
st y=d
d × w, x
• penalize only misclassified samples (y = d),
• zero if all samples are correctly classified,
• proportional to the absolute distance to the hyperplane (d × w, x < 0)
12
Machine learning – Perceptron – Training
Perceptron algorithm
Algorithm:
• Initialize w randomly
• Repeat until convergence
• For all (x, d) ∈ T
• Compute: y = sign w, x
• If y = d:
Update: w ← w + dx
• Converges to some solution if the training data are linearly separable,
• Corresponds to stochastic gradient descent for E(w) (see later),
• But may pick any of many solutions of varying quality.
⇒ Poor generalization error.
13
Machine learning – Perceptron – Training
Variant: ADALINE (Adaptive Linear Neuron) algorithm
(Widrow & Hoff, 1960).
Loss: minimize instead the least square error with the prediction without sign
min
w
E(w) =
(x,d)∈T
( w, x − d)2
Algorithm:
• Initialize w randomly
• Repeat until convergence
• For all (x, d) ∈ T
• Compute: y = w, x
• Update: w ← w + γ(d − y)x, γ > 0
Also corresponds to stochastic gradient descent for E(w) (see later).
14
Machine learning – Perceptron – Training
Perceptron vs ADALINE algorithm
Activation function = sign
15
Machine learning – Perceptron – Perceptrons book
Perceptrons book (Minsky and Papert, 1969)
A perceptron can only classify data points that are linearly separable:
+1
+1
Seen by many as a justification to stop research on perceptrons.
(Source: Vincent Lepetit)
16
Artificial neural network
Machine learning – Artificial neural network
Artificial neural network
(Source: Lucas Masuch & Vincent Lepetit) 17
Machine learning – Artificial neural network
Artificial neural network
• Supervised learning method initially inspired by
the behavior of the human brain.
• Consists of the inter-connection of several
small units (just like in the human brain).
• Essentially numerical but can handle
classification and discrete inputs with
appropriate coding.
• Introduced in the late 50s, very popular in the
90s, reappeared in the 2010s with deep
learning.
• Also referred to as Multi-Layer Perceptron (MLP).
• Historically used after feature extraction.
18
Machine learning – Artificial neural network
Artificial neuron (McCulloch & Pitts, 1943)
• An artificial neuron contains several incoming weighted connections, an
outgoing connection and has a nonlinear activation function g.
• Neurons are trained to filter and detect specific features or patterns (e.g.
edge, nose) by receiving weighted input, transforming it with the
activation function and passing it to the outgoing connections.
• Unlike the perceptron, can be used for regression (with proper choice of g).
19
Machine learning – ANN
Artificial neural network / Multilayer perceptron / NeuralNet
• Inter-connection of several artificial neurons.
• Each level in the graph is called a layer:
• Input layer,
• Hidden layer(s),
• Output layer.
• Each neuron (also called node or unit) in
the hidden layers acts as a classifier.
• Feedforward NN (no cycle)
• first and simplest type of NN,
• information moves in one direction.
• Recurrent NN (with cycle)
• use for time sequences,
• such as speech-recognition.
Quiz: how many layers does
this network have?
20
Machine learning – ANN
Artificial neural network / Multilayer perceptron / NeuralNet
h1 = g1 w1
11x1 + w1
12x2 + w1
13x3 + b1
1
h2 = g1 w1
21x1 + w1
22x2 + w1
23x3 + b1
2
h3 = g1 w1
31x1 + w1
32x2 + w1
33x3 + b1
3
h4 = g1 w1
41x1 + w1
42x2 + w1
43x3 + b1
4
y1 = g2 w2
11h1 + w2
12h2 + w2
13h3 + w2
14h4 + b2
1
y2 = g2 w2
21h1 + w2
22h2 + w2
23h3 + w2
24h4 + b2
2
wk
ij synaptic weight between previous node j and next node i at layer k.
gk are any activation function applied to each coefficient of its input vector.
21
Machine learning – ANN
Artificial neural network / Multilayer perceptron / NeuralNet
h1 = g1 w1
11x1 + w1
12x2 + w1
13x3 + b1
1
h2 = g1 w1
21x1 + w1
22x2 + w1
23x3 + b1
2
h3 = g1 w1
31x1 + w1
32x2 + w1
33x3 + b1
3
h4 = g1 w1
41x1 + w1
42x2 + w1
43x3 + b1
4
h = g1 (W1x + b1)
y1 = g2 w2
11h1 + w2
12h2 + w2
13h3 + w2
14h4 + b2
1
y2 = g2 w2
21h1 + w2
22h2 + w2
23h3 + w2
24h4 + b2
2
y = g2 (W2h + b2)
wk
ij synaptic weight between previous node j and next node i at layer k.
gk are any activation function applied to each coefficient of its input vector.
The matrices Wk and biases bk are learned from labeled training data.
21
Machine learning – ANN
Artificial neural network / Multilayer perceptron
It can have 1 hidden layer only (shallow network),
It can have more than 1 hidden layer (deep network),
each layer may have a different size, and
hidden and output layers often have different activation functions.
22
Machine learning – ANN
Artificial neural network / Multilayer perceptron
• As for the perceptron, the biases can be integrated into the weights:
Wkhk−1 + bk = bk Wk
˜Wk
1
hk−1
˜hk−1
= ˜Wk
˜hk−1
• A neural network with L layers is a function of x parameterized by ˜W :
y = f(x; ˜W ) where ˜W = ( ˜W1, ˜W2, . . . , ˜WL)T
• It can be defined recursively as
y = f(x; ˜W ) = hL, hk = gk
˜Wk
˜hk−1 and h0 = x
• For simplicity, ˜W will be denoted W (when no possible confusions).
23
Machine learning – ANN – Activation functions
Activation functions
Linear units: g(a) = a
y = WLhL−1 + bL
hL−1 = WL−1hL−2 + bL−1
y = WLWL−1hL−2 + WLbL−1 + bL
y = WL . . . W1x +
L−1
k=1
WL . . . Wk+1bk + bL
We can always find an equivalent network without hidden units,
because composition of affine functions are affine.
Sometimes used for linear dimensionality reduction (similarly to PCA).
In general, non-linearity is needed to learn complex (non-linear)
representations of data, otherwise the NN would be just a linear function.
Otherwise, back to the problem of nonlinearly separable datasets.
24
Machine learning – ANN – Activation functions
Activation functions
Threshold units: for instance the sign function
g(a) =
−1 if a < 0
+1 otherwise.
or Heaviside (aka, step) activation functions
g(a) =
0 if a < 0
1 otherwise.
Discontinuities in the hidden layers
make the optimization really difficult.
We prefer functions that are continuous and differentiable.
25
Machine learning – ANN – Activation functions
Activation functions
Sigmoidal units: for instance the hyperbolic tangent function
g(a) = tanh a =
ea
− e−a
ea + e−a
∈ [−1, 1]
or the logistic sigmoid function
g(a) =
1
1 + e−a
∈ [0, 1] -5 0 5
-2
-1
0
1
2
• In fact equivalent by linear transformations :
tanh(a/2) = 2logistic(a) − 1
• Differentiable approximations of the step and sign functions, respectively.
• Act as threshold units for large values of |a| and as linear for small values.
26
Machine learning – ANN
Sigmoidal units: logistic activation functions are used in binary classification
(class C1 vs C2) as they can be interpreted as posterior probabilities:
y = P(C1|x) and 1 − y = P(C2|x)
The architecture of the network defines the shape of the separator...
...
0
1
0 1
0
1
0 1
0
1
0 1
Separation
{x  P(C1|x) = P(C2|x)}
Complexity/capacity of
the network
⇒
Trade-off between
generalization and
overfitting.
27
Machine learning – ANN – Activation functions
Activation functions
“Modern” units:
g(a) = max(a, 0)
ReLU
or g(a) = log(1 + ea
)
Softplus
Most neural networks use ReLU
(Rectifier linear unit) – max(a, 0) –
nowadays for hidden layers, since it
trains much faster, is more expressive
than logistic function and prevents the
gradient vanishing problem (see later).
(Source: Lucas Masuch)
28
Machine learning – ANN
Neural networks solve non-linear separable problems
(Source: Vincent Lepetit) 29
Machine learning – UAT
Universal Approximation Theorem
(Hornik et al, 1989; Cybenko, 1989)
Any continuous function can be approximated by a feedforward shallow
network (i.e., with 1-hidden layer only) with a sufficient number of neurons in
the hidden layer.
• The theorem does not say how large the network needs to be.
• No guarantee that the training algorithm will be able to train the network.
30
Tasks, architectures and loss functions
Tasks, architectures and loss functions
Approximation – Least square regression
• Goal: Predict a real multivariate function.
• How: estimate the coefficients W of y = f(x; W )
from labeled training examples where labels are real vectors:
• Typical architecture:
• Hidden layer:
ReLU(a) = max(a, 0)
• Linear output:
g(a) = a
31
Tasks, architectures and loss functions
Approximation – Least square regression
• Loss: As for the polynomial curve fitting, it is standard to consider the
sum of square errors (assumption of Gaussian distributed errors)
E(W ) =
N
i=1
||yi
− di
||2
2 =
N
i=1
||f(xi
; W ) − di
||2
2
and look for W ∗
such that E(W ∗
) = 0. Recall: SSE ≡ SSD ≡ MSE
• Solution: Provided the network as enough flexibility and the size of the
training set grows to infinity
y = f(x; W ) = E[d|x] = dp(d|x) dd
posterior mean
32
Tasks, architectures and loss functions
Binary classification – Logistic regression
• Goal: Classify object x into class C1 or C2.
• How: estimate the coefficients W of a real function y = f(x; W ) ∈ [0, 1]
from training examples with labels 1 (for class C1) and 0 (otherwise):
T = {(xi
, di
)}1=1..N
• Typical architecture:
• Hidden layer:
ReLU(a) = max(a, 0)
• Output layer:
logistic(x) =
1
1 + e−a
33
Tasks, architectures and loss functions
Binary classification – Logistic regression
• Loss: it is standard to consider the cross-entropy for two-classes
(assumption of Bernoulli distributed data)
E(W ) = −
N
i=1
di
log yi
+ (1 − di
) log(1 − yi
) with yi
= f(xi
; W )
and look for W ∗
such that E(W ∗
) = 0.
• Solution: Provided the network as enough flexibility and the size of the
training set grows to infinity
y = f(x; W ) = P(C1|x)
posterior probability
34
Tasks, architectures and loss functions
Multiclass classification – Multivariate logistic regression
(aka, multinomial classification)
• Goal: Classify an object x into one among K classes C1, . . . , CK .
• How: estimate the coefficients W of a multivariate function
y = f(x; W ) ∈ [0, 1]K
from training examples T = {(xi
, di
)} where di
is a 1-of-K (one-hot) code
• Class 1: di
= (1, 0, . . . , 0)T
if xi
∈ C1
• Class 2: di
= (0, 1, . . . , 0)T
if xi
∈ C2
• . . .
• Class K: di
= (0, 0, . . . , 1)T
if xi
∈ CK
• Remark: do not use the class index k directly as a scalar label.
35
Tasks, architectures and loss functions
Multiclass classification – Multivariate logistic regression
• Typical architecture:
• Hidden layer:
ReLU(a) = max(a, 0)
• Output layer:
softmax(a)k =
exp(ak)
K
l=1 exp(al)
Softmax guarantees the outputs yk to be positive and sum to 1.
Generalization of the logistic sigmoid activation function.
Smooth version of winner-takes-all activation model (maxout).
(largest gets +1 others get 0).
36
Tasks, architectures and loss functions
Multiclass classification – Multivariate logistic regression
• Loss: it is standard to consider the cross-entropy for K classes
(assumption of multinomial distributed data)
E(W ) = −
N
i=1
K
k=1
di
k log yi
k with yi
= f(xi
; W )
and look for W ∗
such that E(W ∗
) = 0.
• Solution: Provided the network as enough flexibility and the size of the
training set grows to infinity
yk = fk(x; W ) = P(Ck|x)
posterior probability
37
Machine learning – ANN
Multi-label classification:
(aka, multi-output classification)
• Goal: Classify object x into zero or several classes C1, . . . , CK .
The classes need to be non-mutually exclusive.
• How: Combination of binary-classification networks.
• Typical architecture:
• Remark: For categorical inputs, use also 1-of-K codes.
38
Backpropagation
Machine learning – ANN - Backpropagation
Learning with backpropagation
(Source: Lucas Masuch & Vincent Lepetit) 39
Machine learning – ANN – Learning
Training process
• To train a neural network over a large set of labeled data, you must
continuously compute the difference between the network’s predicted
output and the actual output.
• This difference is measured by the loss E, and the process for training a
net is known as backpropagation, or backprop.
• During backprop, weights and biases are tweaked slightly until the lowest
possible loss is achieved.
−→ Optimization: look for E = 0
• The gradient is an important aspect of this process, as a measure of how
much the loss changes with respect to a change in a weight or bias value.
(Source: Caner Hazırba¸s)
40
Machine learning – ANN – Learning
Training process
(Source: Lucas Masuch) 41
Machine learning – ANN – Optimization
Objective: min
W
E(W ) ⇒ E(W ) = ∂E(W )
∂W1
. . . ∂E(W )
∂WL
T
= 0
Loss functions: recall that classical loss functions are
• Square error (for regression: dk ∈ R, yk ∈ R)
E(W ) =
1
2
(x,d)∈T
||y − d||2
2 =
1
2
(x,d)∈T k
(yk − dk)2
• Cross-entropy (for multi-class classification: dk ∈ {0, 1}, yk ∈ [0, 1])
E(W ) = −
(x,d)∈T k
dk log yk
Solution: no close-form solutions ⇒ use gradient descent. What is it?
42
Machine learning – ANN – Optimization – Gradient descent
An iterative algorithm trying to find a minimum of a real function.
Gradient descent
• Let F be a real function, differentiable and lower bounded with a L
Lipschitz gradient (see next slide). Then, whatever the initialization x0
,
if 0 < γ < 2/L, the sequence
xt+1
= xt
− γ F(xt
) ,
converges to a stationary point x (i.e., it cancels the gradient)
F(x ) = 0 .
• The parameter γ is called the step size (or learning rate in ML field).
• A too small step size γ leads to slow convergence.
43
Machine learning – ANN – Optimization – Gradient descent
Lipschitz gradient
• A differentiable function F has L Lipschitz gradient, if
|| F(x1) − F(x2)||2 L||x1 − x2||2, for all x1, x2 .
• The mapping x → F(x) is necessarily continuous.
• If F is twice differentiable with bounded Hessian, then
L = sup
x
|| 2
F(x)
Hessian matrix of F
||2.
where for a matrix A, its 2-norm ||A||2 is its maximal singular value.
44
Machine learning – ANN – Optimization – Gradient descent
Gradent descent and Lipschitz gradient
Example (1d quadratic loss)
• Consider: F(x) = 1
2
(x − y)2
• We have: F (x) = x − y
• And: F (x) = 1
• Then: L = supx |F (x)| = 1
• Thus: The range for γ is 0 < γ < 2/L = 2
• And: xt+1
= xt
− γ(xt
− y) converges towards x , whatever x0
,
such that F (x ) = 0 ⇒ x = y
Question: what is the best value for γ?
45
Machine learning – ANN – Optimization – Gradient descent
Gradent descent and Lipschitz gradient
Example (1d quartic loss)
• Consider: F(x) = 1
4
(x − y)4
• We have: F (x) = (x − y)3
• And: F (x) = 3(x − y)2
• Then: L = supx |F (x)| = ∞
• Thus: There are no γ satisfying 0 < γ < 2/L
• And: xt+1
= xt
− γ(xt
− y)3
, for γ > 0, may diverge,
oscilate forever or converge to the solution y.
Convergence can be obtained for some specific choice
of γ and initialization x0
.
46
Machine learning – ANN – Optimization – Gradient descent
These two curves cross at x such that F(x ) = 0
47
Machine learning – ANN – Optimization – Gradient descent
Here γ is small: slow convergence
47
Machine learning – ANN – Optimization – Gradient descent
γ a bit larger: faster convergence
47
Machine learning – ANN – Optimization – Gradient descent
γ ≈ 1/L even larger: around fastest convergence
47
Machine learning – ANN – Optimization – Gradient descent
γ a bit too large: convergence slows down
47
Machine learning – ANN – Optimization – Gradient descent
γ too large: convergence too slow again
47
Machine learning – ANN – Optimization – Gradient descent
γ > 2/L: divergence
48
Machine learning – ANN – Optimization – Gradient descent
Gradient descent for convex function
• If moreover F is convex
F(λx1 + (1 − λ)x2) λF(x1) + (1 − λ)F(x2), ∀x1, x2, λ ∈ (0, 1) ,
then, the gradient descent converges towards a global minimum
x ∈ argmin
x
F(x).
• For 0 < γ < 2/L, the sequence |F(xk
) − F(x )| decays in O(1/k).
• NB: All stationary points are global minima (non necessarily unique).
49
Machine learning – ANN – Optimization – Gradient descent
One-dimension
Two-dimensions
*
*
*
*
50
Machine learning – ANN – Optimization – Gradient descent
Influence of the step parameter (learning rate)
51
Machine learning – ANN – Optimization
Let’s start with a single artifical neurons
Example ((Batch) Perceptron algorithm)
• Model: y = sign w, x
• Loss: E(w) = −
(x,d)∈T
st y=d
d × w, x
• Gradient: E(w) = −
(x,d)∈T
st y=d
d × x
• Gradient descent: wt+1
← wt
+ γ
(x,d)∈T
st yt
=d
d × x
Convex or non-convex?
52
Machine learning – ANN – Optimization
Let’s start with a single artifical neurons
Example ((Batch) ADALINE)
• Model: y = w, x
• Loss: E(w) =
1
2
(x,d)∈T
( w, x − d)2
=d2+wT xxT w−2dwT x
• Gradient: E(w) =
(x,d)∈T
x(xT
w
y
−d)
• Gradient descent: wt+1
← wt
+ γ
(x,d)∈T
(d − yt
)x
Convex or non-convex?
53
Machine learning – ANN – Optimization
Let’s start with a single artifical neurons
Example ((Batch) ADALINE)
• Model: y = w, x
• Loss: E(w) =
1
2
(x,d)∈T
( w, x − d)2
=d2+wT xxT w−2dwT x
• Gradient: E(w) =
(x,d)∈T
x(xT
w
y
−d)
• Gradient descent: wt+1
← wt
+ γ
(x,d)∈T
(d − yt
)x
Convex or non-convex?
If enough training samples: lim
t→∞
wt
=
(x,d)∈T
xxT
Hessian
−1
(x,d)∈T
dx
53
Machine learning – ANN – Optimization
Back to our optimization problem
In our case W → E(W ) is non-convex ⇒ No guarantee of convergence.
Convergence will depends on
• the initialization,
• the step size γ.
Because of this:
• Normalizing each data point x in the range [−1, +1] is important to
control for the Hessian → the stability and the speed of the algorithm,
• The activation functions and the loss should be chosen to have a second
derivative smaller than 1 (when combined),
• The initialization should be random with well chosen variance
(we will go back to this later),
• If all of these are satisfied, we can generally choose γ ∈ [.001, 1].
54
Machine learning – ANN – Optimization
Back to our optimization problem
In our case W → E(W ) is non-convex ⇒ No guarantee of convergence.
Even if so, the limit solution depends on:
• the initialization,
• the step size γ.
Nevertheless, really good minima or saddle points are reached in practice by
W t+1
← W t
− γ E(W t
), γ > 0
Gradient descent can be expressed coordinate by coordinate as:
wt+1
i,j ← wt
i,j − γ
∂E(W t
)
∂wt
i,j
for all weights wi,j linking a node j to a node i in the next layer.
⇒ The algorithm to compute
∂E(W )
∂wi,j
for ANNs is called backpropagation.
55
Machine learning – ANN – Optimization
Backpropagation: computation of
∂E(W )
∂wi,j
Feedforward least square regression context
• Model: Feed-forward neural network.
(for simplicity without bias)
• Loss function: E(W ) =
1
2
(x,d)∈T k
(yk − dk)2
We have: E(W ) =
(x,d)∈T k
1
2
(yk − dk)2
ek
Apply linearity:
∂E(W )
∂wi,j
=
(x,d)∈T k
∂ek
∂wi,j
56
Machine learning – ANN – Optimization
1. Case where wij is a synaptic weight for the output layer
• j: neuron in the previous hidden layer
• hj: response of hidden neuron j
• wi,j: synaptic weight between j and i
• yi: response of output neuron i
yi = g (ai) with ai =
j
wi,jhj
Apply chain rule:
∂E(W )
∂wi,j
=
(x,d)∈T k
∂ek
∂wi,j
=
(x,d)∈T k
∂ek
∂yi
∂yi
∂ai
∂ai
∂wi,j
57
Machine learning – ANN – Optimization
1. Case where wij is a synaptic weight for the output layer
ek =
1
2
(yk − dk)2
⇒
∂ek
∂yi
=
yi − di if k = i
0 otherwise
yi = g(ai) ⇒
∂yi
∂ai
= g (ai),
ai =
j
wi,j hj ⇒
∂ai
∂wi,j
= hj
⇒
∂E(W )
∂wi,j
=
(x,d)∈T k
∂ek
∂yi
∂yi
∂ai
∂ai
∂wi,j
=
(x,d)∈T
(yi − di)g (ai)
δi
hj
=
(x,d)∈T
δihj where δi =
k
∂ek
∂ai
58
Machine learning – ANN – Optimization
2. Case where wij is a synaptic weight for a hidden layer
• j: neuron in the previous hidden layer
• hj: response of hidden neuron j
• wi,j: synaptic weight between j and i
• hi: response of hidden neuron i
hi = g (ai) with ai =
i
wi,jhj
Apply chain rule:
∂E(W )
∂wi,j
=
(x,d)∈T k
∂ek
∂wi,j
=
(x,d)∈T k
∂ek
∂hi
∂hi
∂ai
∂ai
∂wi,j
=
(x,d)∈T k l
∂ek
∂al
∂al
∂hi
∂hi
∂ai
∂ai
∂wi,j
=
(x,d)∈T l k
∂ek
∂al
δl
∂al
∂hi
∂hi
∂ai
∂ai
∂wi,j
59
Machine learning – ANN – Optimization
2. Case where wij is a synaptic weight for a hidden layer
al =
i
wl,i hi ⇒
∂al
∂hi
= wl,i
hi = g(ai) ⇒
∂hi
∂ai
= g (ai),
ai =
j
wi,j hj ⇒
∂aj
∂wi,j
= hj
⇒
∂E(W )
∂wi,j
=
(x,d)∈T l
δl
∂al
∂hi
∂hi
∂ai
∂ai
∂wi,j
=
(x,d)∈T l
wl,iδl g (ai)
δi
hj
=
(x,d)∈T
δihj
60
Machine learning – ANN – Optimization
Backpropagation algorithm
(Werbos, 1974 & Rumelhart, Hinton and Williams, 1986)
∂E(W )
∂wi,j
=
(x,d)∈T
δihj where hj = xj if j is an input node
where δi = g (ai) ×



yi − di if i is the output node
l
wl,iδl otherwise
For all input x and desired output d
• Forward step:
→ compute the response (hj, ai and yi) of all neurons,
→ start from the first hidden layer and pursue towards the output one.
• Backward step:
→ Retropropagate the error (δi) from the output layer to the first layer.
Update wi,j ← wi,j − γ δjhi, and repeat everything until convergence.
61
Machine learning – ANN – Optimization
Backpropagation algorithm with matrix-vector form
Easier to use matrix-vector notations for each layer:
(k denotes the layer)
Wk E(W ) = δkhT
k−1 where h0 = x
where δk = g (ak)×
element-wise
y − d if k is an output layer
W T
k+1δk+1 otherwise
• x: matrix with all training input vectors in column,
• d: matrix with corresponding desired target vectors in column,
• y: matrix with all predictions in column,
• ak = Wkhk−1: matrix with all weighted sums in column,
• hk = g(ak): matrix with all hidden outputs in column,
• Wk: matrix of weights at layer k,
62
Machine learning – ANN – Optimization
Backpropagation algorithm
Forward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Forward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Forward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Forward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Forward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Forward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Forward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Forward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Error evaluation
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Backward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Backward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Backward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Backward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Backward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Backward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Backward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Backward phase
63
Machine learning – ANN – Optimization
Backpropagation algorithm
Backward phase
63
Machine learning – ANN – Optimization
Case where g is the logistic sigmoid function
• Recall that the logistic sigmoid function is given by:
g(a) =
1
1 + e−a
=
1
u(a)
where u(a) = 1 + e−a
• We have:
u (a) = −e−a
• Then, we get
g (a) =
−u (a)
u(a)2
=
e−a
(1 + e−a)2
=
1 + e−a
(1 + e−a)2
−
1
(1 + e−a)2
=
1
1 + e−a
−
1
(1 + e−a)2
=
1
1 + e−a
1 −
1
1 + e−a
= g(a) (1 − g(a))
64
Machine learning – ANN – Optimization
What if g is non-differentiable?
For instance: g(a) = ReLU(a) = max(a, 0)
• ReLU is continuous and almost everywhere differentiable:
∂ max(a, 0)
∂a
=



1 if a > 0
0 if a < 0
undefined otherwise
• Gradient descent can handle this case with a simple modification:
∂ max(a, 0)
∂a
=
1 if a > 0
0 if a 0
• When g is convex, this is called a sub-gradient, and gradient descent is
called sub-gradient descent.
65
Machine learning – ANN – Optimization
Let’s try to learn the x-or function
import numpy as np
# Training set
x = np.array ([[0 ,0] , [0,1], [1,0], [1 ,1]]).T
d = np.array ([ 0, +1, +1, 0 ]).T
# Initialization for a 2 layer feedforward network
b1 = np.random.rand(2, 1)
W1 = np.random.rand(2, 2)
b2 = np.random.rand(1, 1)
W2 = np.random.rand(1, 2)
# Activation functions and their derivatives
def g1(a): return a * (a > 0) # ReLU
def g1p(a): return 1 * (a > 0)
def g2(a): return a # Linear
def g2p(a): return 1
66
Machine learning – ANN – Optimization
Let’s try to learn the x-or function
import numpy as np
# Training set
x = np.array ([[0 ,0] , [0,1], [1,0], [1 ,1]]).T
d = np.array ([ 0, +1, +1, 0 ]).T
# Initialization for a 2 layer feedforward network
b1 = np.random.rand(2, 1)
W1 = np.random.rand(2, 2)
b2 = np.random.rand(1, 1)
W2 = np.random.rand(1, 2)
# Activation functions and their derivatives
def g1(a): return a * (a > 0) # ReLU
def g1p(a): return 1 * (a > 0)
def g2(a): return 1 / (1 + np.exp(-a)) # Logistic
def g2p(a): return g2(a) * (1 - g2(a))
66
Machine learning – ANN – Optimization
Let’s try to learn the x-or function
gamma = .01 # step parameter ( learning rate)
f o r t i n range (0, 10000):
# Forward phase
a1 = W1.dot(x)
h1 = g1(a1)
a2 = W2.dot(h1)
y = g2(a2)
# Error evaluation
e = y - d
# Backward phase
delta2 = g2p(a2) * e
delta1 = g1p(a1) * W2.T.dot(delta2)
# gradient update
W2 = W2 - gamma * delta2.dot(h1.T)
W1 = W1 - gamma * delta1.dot(x.T)
_
_
67
Machine learning – ANN – Optimization
Let’s try to learn the x-or function
Remark 1: dealing with the
bias is similar
gamma = .01 # step parameter
f o r t i n range (0, 10000):
# Forward phase
a1 = W1.dot(x) + b1
h1 = g1(a1)
a2 = W2.dot(h1) + b2
y = g2(a2)
# Error evaluation
e = y - d
# Backward phase
delta2 = g2p(a2) * e
delta1 = g1p(a1) * W2.T.dot(delta2)
# gradient update
W2 = W2 - gamma * delta2.dot(h1.T)
W1 = W1 - gamma * delta1.dot(x.T)
b2 = b2 - gamma * delta2.sum(axis=1, keepdims=True)
b1 = b1 - gamma * delta1.sum(axis=1, keepdims=True)
67
Machine learning – ANN – Optimization
Let’s try to learn the x-or function
Remark 1: dealing with the
bias is similar
Remark 2: using cross-entropy
is simple too
gamma = .01 # step parameter
f o r t i n range (0, 10000):
# Forward phase
a1 = W1.dot(x) + b1
h1 = g1(a1)
a2 = W2.dot(h1) + b2
y = g2(a2)
# Error evaluation
e = -d / y + (1 - d) / (1 - y)
# Backward phase
delta2 = g2p(a2) * e
delta1 = g1p(a1) * W2.T.dot(delta2)
# gradient update
W2 = W2 - gamma * delta2.dot(h1.T)
W1 = W1 - gamma * delta1.dot(x.T)
b2 = b2 - gamma * delta2.sum(axis=1, keepdims=True)
b1 = b1 - gamma * delta1.sum(axis=1, keepdims=True)
67
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 1
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 11
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 21
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 31
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 41
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 51
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 61
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 71
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 81
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 91
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 101
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 201
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 301
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 401
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 501
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 601
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 701
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 801
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 901
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 1001
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 2001
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 3001
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 4001
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 5001
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 6001
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 7001
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 8001
68
Machine learning – ANN – Optimization
Let’s see how it works
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration 9001
68
Machine learning – ANN – Optimization
Comparisons – different initializations & activations
Success (Global minima)
LinearoutputLogisticoutput
69
Machine learning – ANN – Optimization
Comparisons – different initializations & activations
Failures (Local minima or saddle points)
LinearoutputLogisticoutput
70
Machine learning – ANN – Optimization
The 1990s view of ANN and back-propagation
Advantages
• Universal approximators,
• May be very accurate.
Drawbacks
• Black box models (very difficult to interpret).
• The learning time does not scale well
• it was very slow in networks with multiple hidden layers.
• It got stuck at local optima or saddle points
• these can be nevertheless often surprisingly good.
• All global minima do not have the same quality (generalization error)
• strong variations in prediction errors on different testing datasets.
71
Support Vector Machine
Machine learning – Support Vector Machine
Support Vector Machine
(Source: Lucas Masuch & Vincent Lepetit) 72
Machine learning – Support Vector Machine
Support Vector Machine
(Vapnik, 1995)
• Very successful methods in the mid-90’s.
• Developed for binary classification.
• Clever type of perceptron.
• Based on two major ideas
1 large margin,
2 kernel trick.
Many NNs researchers switched to SVMs in the 1990s
because they (used to) work better.
73
Machine learning – Support Vector Machine
Shattering points with oriented hyperplanes
• Goal: Build hyperplanes that separate points in two classes,
• Question: Which is the best separating line?
Remember, hyperplane: w, x + b = 0
74
Machine learning – Support Vector Machine
Classification margin
• Signed distance from an example to the separator: r =
x, w + b
||w||
• Examples closest to the hyperplane are support vectors.
• Margin ρ is the distance between the separator and the support vector(s).
• Is this one good?
75
Machine learning – Support Vector Machine
Classification margin
• Signed distance from an example to the separator: r =
x, w + b
||w||
• Examples closest to the hyperplane are support vectors.
• Margin ρ is the distance between the separator and the support vector(s).
• Is this one good?
• Consider two testing samples,
where are they assigned?
75
Machine learning – Support Vector Machine
Classification margin
• Signed distance from an example to the separator: r =
x, w + b
||w||
• Examples closest to the hyperplane are support vectors.
• Margin ρ is the distance between the separator and the support vector(s).
• Is this one good?
• Consider two testing samples,
where are they assigned?
75
Machine learning – Support Vector Machine
Classification margin
• Signed distance from an example to the separator: r =
x, w + b
||w||
• Examples closest to the hyperplane are support vectors.
• Margin ρ is the distance between the separator and the support vector(s).
• Is this one good?
• Consider two testing samples,
where are they assigned?
• This separator may have large
generalization error.
75
Machine learning – Support Vector Machine
Largest margin = Support Vector Machine
• Maximizing the margin reduces generalization error (see, PAC learning).
• Then, there are necessary support vectors on both sides of the hyperplane.
• Only support vectors are important ⇒ other examples can be discarded.
Poor separation Optimal separation
76
Machine learning – Support Vector Machine
Training
• As for neural networks, the parameters w and b are obtained by optimizing
a loss function (the margin).
• What is the formula for the margin?
• Training dataset: feature vectors xi
, targeted class di
∈ {−1, +1}
• Assume the training samples to be linearly separable,
i.e., there exist w and b such that
xi
, w + b +1 if di
= +1
xi
, w + b −1 if di
= −1
di
( xi
, w + b) 1
Then: | xi
, w + b| 1
77
Machine learning – Support Vector Machine
Training
• For support vectors, the inequality can be forced to be an equality
| xi
, w + b| = 1
• Then, the distance between any support vector and the hyperplane is
|r| =
| xi
, w + b|
||w||
=
1
||w||
• So, the margin is (by definition)
ρ = |r| =
1
||w||
• Maximizing the margin is then equivalent to minimizing ||w||2
, providing
the hyperplane (w, b) separates the data.
78
Machine learning – Support Vector Machine
Training
• Therefore, the problem can be recast as
min
w,b
1
2
||w||2
subject to di
( xi
, w + b) 1, for all i.
⇒ Quadratic (convex) optimization problem subject to linear constraints,
⇒ No local minima! Only a single global one.
Equivalent formulation:
min
w,b
E(w, b)
where E(w, b) =
1
2
||w||2
if di
( xi
, w + b) 1, for all i
+∞ otherwise
79
Machine learning – Support Vector Machine
Training
min
w,b
E(w, b)
where E(w, b) =
1
2
||w||2
if di
( xi
, w + b) 1, for all i
+∞ otherwise
• We can use the techniques of Lagrange multipliers:
1 Introduce a Lagrange multiplier αi 0 for each constraint.
2 Let α = (α1, α2, . . .)T
be the vector of Lagrange multipliers.
3 Define the (primal) function as
LP (w, b, α) =
1
2
||w||2
−
i
αi
(di
( xi
, w + b) − 1)
inequality constraint
4 Solve the saddle point problem
E(w, b) = max
α
LP (w, b, α) ⇒ min
w,b
E(w, b) = min
w,b
max
α
LP (w, b, α)
80
Machine learning – Support Vector Machine
Training
• When the original objective-function is convex (and only then), we can
interchange the minimization and maximization
min
w,b
max
α
LP (w, b, α) = max
α
min
w,b
LP (w, b, α)
• The following is called dual problem
LD(α) = min
w,b



LP (w, b, α) =
1
2
||w||2
w, w
−
i
αi
(di
( xi
, w + b) − 1)



• For a fixed α, the minimum is achieved by simultaneously canceling the
gradients with respect to w and b
w LP (w, b, α) = 0 ⇒ w −
i
αi
di
xi
= 0 ⇒ w =
i
αi
di
xi
and b LP (w, b, α) = 0 ⇒
i
αi
di
= 0
81
Machine learning – Support Vector Machine
Training
The dual is obtained by plugging these equations in LP (w, b, α):
LD(α) =
1
2
||w||2
w, w
−
i
αi
(di
( xi
, w + b) − 1)
=
1
2 i
αi
di
xi
,
j
αj
dj
xj
−
i
αi
di
( xi
,
j
αj
dj
xj
+ b) +
i
αi
=
1
2 i j
αi
αj
di
dj
xi
, xj
−
i j
αi
αj
di
dj
xi
, xj
+ b
i
αi
di
=0
+
i
αi
=
i
αi
−
1
2 i j
αi
αj
di
dj
xi
, xj
82
Machine learning – Support Vector Machine
SVM training algorithm
1 Maximize the dual (e.g., using coordinate projected gradient descent):
max
α
LD(α) =
i
αi
−
1
2 i j
αi
αj
di
dj
xi
, xj
subject to αi
0 and
i
αi
di
= 0, for all i.
2 Each non-zero αi
indicates that corresponding xi
is a support vector.
3 Deduce the parameters from these support vectors (s.v)
w =
i (s.v.)
αi
di
xi
and b = di
−
j (s.v.)
αj
dj
xi
, xj
for any s.v. i
• The dual is in practice more efficient to solve than the original problem,
and it allows identifying the support vectors.
• It only depends on the dot products xi
, xj
between all training points.
83
Machine learning – Support Vector Machine
SVM classification algorithm
• Given the parameters of the hyperplane w and b,
the SVM classify a new sample x as
y = w, x + b 0
(same as for the perceptron).
• But (very important), this can be reformulated as
y =
i (s.v.)
αi
di
x, xi
+ b
which only depends on the dot products x, xi
between x and the
support vectors – we will return to this later.
84
Machine learning – Support Vector Machine
Non-separable case
What if the training set is not linearly separable?
The optimization problem does not admit solutions.
85
Machine learning – Support Vector Machine
Non-separable case
What if the training set is not linearly separable?
The optimization problem does not admit solutions.
Solution: allows the system to make errors εi.
85
Machine learning – Support Vector Machine
Non-separable case – Soft-margin SVM
• Just relax the constraints by permitting errors to some extent
min
w,b,ε
1
2
||w||2
+ C
i
εi
subject to di
( xi
, w + b) 1 − εi and εi 0, for all i.
• Quantities εi are called slack variables.
• The parameter C > 0 is chosen by the user and controls overfitting.
• The dual problem becomes
max
α
LD(α) =
i
αi
−
1
2 i j
αi
αj
di
dj
xi
, xj
subject to 0 αi C and
i
αi
di
= 0, for all i.
and does not depend on the slack variables εi.
86
Machine learning – Support Vector Machine
Linear SVM – Overview
• SVM finds the separating hyperplane maximizing the margin.
• Soft-margin SVM: trade-off between large margin and errors.
• These optimal hyperplanes are only defined in terms of support vectors.
• Lagrangian formulation allows identifying the support vectors with
non-zero Lagrange multipliers αi
.
• The training and classification algorithms both depend only on dot
products between data points.
So far, the SVM is a linear separator (the best in some sense).
But, it cannot solve non-linear problems (such as xor),
as done by multi-layer neural networks.
87
Machine learning – Support Vector Machine
Non-linear SVM – Overview
What happens if the separator is non-linear?
88
Machine learning – Support Vector Machine
Non-linear SVM – Map to higher dimensions
Here: (x1, x2) → (x2
1, x2
2,
√
2x1x2)
The input space (original feature space) can always be mapped to some
higher-dimensional feature space where the training data set is linearly
separable, via some (non-linear) transformation x → ϕ(x).
89
Machine learning – Support Vector Machine
Non-linear SVM – Map to higher dimensions
Replace all occurrences of x by ϕ(x), for the training step
LD(α) =
i
αi
−
1
2 i j
αi
αj
di
dj
ϕ(xi
), ϕ(xj
)
and the classification step
y =
i (s.v.)
αi
di
ϕ(x), ϕ(xi
) + b
• May require a feature space of really high (even infinite) dimension.
• In this case, manipulating ϕ(x) might be tough, impossible, or lead to
intense computation: complexity depends on the feature space dimension.
90
Machine learning – Support Vector Machine
Non-linear SVM – Kernel trick
• SVMs do not care about feature vectors ϕ(x).
• They rely only on dot products between them. Define
K(x, x ) = ϕ(x), ϕ(x )
• K is called kernel function: dot product in a higher dimensional space.
• Even if ϕ(x) is tough to manipulate, K(x, x ) can be rather simple.
• Mercer’s theorem: K continuous and symmetric positive semi-definite
(i.e., the matrix K = (K(xi
, xj
))i,j is symmetric positive semi-definite for
all finite sequences x1
, . . . , xn
) then there exists a mapping ϕ such that
K(x, x ) = ϕ(x), ϕ(x )
• We do not even need to know ϕ, pick a continuous symmetric positive
semi-definite kernel K and use it instead of dot products.
91
Machine learning – Support Vector Machine
Non-linear SVM – Kernel trick
• Replace all occurrences of x, x by K(x, x ), for the training step
LD(α) =
i
αi
−
1
2 i j
αi
αj
di
dj
K(xi
, xj
)
and the classification step
y =
i (s.v.)
αi
di
K(x, xi
) + b
• Python implementation available in scikit-learn.
Complexity depends on the input space dimension.
92
Machine learning – Support Vector Machine
Non-linear SVM – Standard kernels
• Linear: K(x, x ) = x, x
• Polynomial: K(x, x ) = (γ x, x + β)p
• Gaussian: K(x, x ) = exp −γ||x − x ||2
−→ Radial basis function (RBF) network.
• Sigmoid: K(x, x ) = tanh(γ x, x + β)
In practice: K is usually chosen by trial and error.
A linear SVM is an optimal perceptron but what is a non-linear SVM? 93
Machine learning – Support Vector Machine
Non-linear SVMs are shallow ANNs
Consider: K(x, x ) = tanh(γ x, x + β)
We get: y =
i (s.v.)
αi
di
K(x, xi
) + b ← SVM
y =
i (s.v.)
αi
di
tanh(γ x, xi
+ β) + b
y = w2, g(W1x + b1) + b2 ← Shallow ANN
where W1 = γ x1
x2
. . .
support vectors
T
, b1 = β1,
w2 = α1
d1
α2
d2
. . .
Lagrande multipliers and desired
class of support vectors
, b2 = b, and g = tanh .
The number of support vectors is the number of hidden units.
94
Machine learning – Support Vector Machine
Difference between SVMs and ANNs
• SVM:



• based on stronger mathematical theory,
• avoid over-fitting,
• not trapped in local-minima,
• works well with fewer training samples.
• But limited to binary classification
→ extensions to regression in (Vapnik et al., 1997)
and PCA in (Sch¨olkopf et al, 1999).
• Tuning SVMs is tough:
→ selecting a specific kernel (feature space) and
regularization parameter C done by trial and errors.
95
Machine learning – Support Vector Machine
Similarity between SVMs and ANNs
(in binary classification)
• Similar formalisms:
y = w, ϕ(x) + b 0 ?
SVM
• ϕ(x): feature vector,
• ϕ: non-linear function implicitly
defined by the kernel K,
• UAT: ϕ could be approximated by
a shallow NN.
ANN
• ϕ(x): output of the last hidden layer,
• ϕ: non-linear recursive function
learned by backpropagation,
• the output of hidden layers can be
interpreted as feature vectors.
96
Machine learning – Support Vector Machine
Similarity between SVMs and ANNs
(in binary classification)
• Both are linear separators in a suitable feature space,
• Mapping to the feature space is obtained by a non-linear transform,
• For SVM, the non-linear transform is chosen by the user.
Instead, ANNs learn it with backpropagation.
• ANNs can also do it recursively layer by layer:
→ this is called a feature hierarchy,
→ this is the foundation of deep learning.
97
Machine learning – Deep learning
What’s next?
(Source: Lucas Masuch & Vincent Lepetit) 98
Questions?
Next class: Introduction to deep learning
Slides and images courtesy of
• P. Gallinari
• V. Lepetit
• A. O. i Puig
• L. Masuch
• A. W. Moore
• C. Hazırba¸s
• M. Welling
98

More Related Content

What's hot

Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
홍배 김
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
홍배 김
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
Mark Chang
 
Anomaly Detection and Localization Using GAN and One-Class Classifier
Anomaly Detection and Localization  Using GAN and One-Class ClassifierAnomaly Detection and Localization  Using GAN and One-Class Classifier
Anomaly Detection and Localization Using GAN and One-Class Classifier
홍배 김
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice
남주 김
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
Taegyun Jeon
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Universitat Politècnica de Catalunya
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Universitat Politècnica de Catalunya
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
남주 김
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Applied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksApplied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural Networks
Mark Chang
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain
홍배 김
 
Backpropagation for Neural Networks
Backpropagation for Neural NetworksBackpropagation for Neural Networks
Backpropagation for Neural Networks
Universitat Politècnica de Catalunya
 
Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
Universitat Politècnica de Catalunya
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
Mark Chang
 
Generative modeling with Convolutional Neural Networks
Generative modeling with Convolutional Neural NetworksGenerative modeling with Convolutional Neural Networks
Generative modeling with Convolutional Neural Networks
Denis Dus
 

What's hot (20)

Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
 
Anomaly Detection and Localization Using GAN and One-Class Classifier
Anomaly Detection and Localization  Using GAN and One-Class ClassifierAnomaly Detection and Localization  Using GAN and One-Class Classifier
Anomaly Detection and Localization Using GAN and One-Class Classifier
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
 
Applied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksApplied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural Networks
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain
 
Backpropagation for Neural Networks
Backpropagation for Neural NetworksBackpropagation for Neural Networks
Backpropagation for Neural Networks
 
Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Generative modeling with Convolutional Neural Networks
Generative modeling with Convolutional Neural NetworksGenerative modeling with Convolutional Neural Networks
Generative modeling with Convolutional Neural Networks
 

Similar to MLIP - Chapter 2 - Preliminaries to deep learning

JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
hirokazutanaka
 
Perceptron
PerceptronPerceptron
Perceptron
Nagarajan
 
10-Perceptron.pdf
10-Perceptron.pdf10-Perceptron.pdf
10-Perceptron.pdf
ESTIBALYZJIMENEZCAST
 
SOFT COMPUTERING TECHNICS -Unit 1
SOFT COMPUTERING TECHNICS -Unit 1SOFT COMPUTERING TECHNICS -Unit 1
SOFT COMPUTERING TECHNICS -Unit 1sravanthi computers
 
Deep learning algorithms
Deep learning algorithmsDeep learning algorithms
Deep learning algorithms
Revanth Kumar
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronMachine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - Perceptron
Andrew Ferlitsch
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural Networks
Andrew Ferlitsch
 
Artificial Neural Network
Artificial Neural Network Artificial Neural Network
Artificial Neural Network
Iman Ardekani
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
Vara Prasad
 
lecture07.ppt
lecture07.pptlecture07.ppt
lecture07.pptbutest
 
Neural Networks and Deep Learning: An Intro
Neural Networks and Deep Learning: An IntroNeural Networks and Deep Learning: An Intro
Neural Networks and Deep Learning: An Intro
Fariz Darari
 
Artificial neural networks (2)
Artificial neural networks (2)Artificial neural networks (2)
Artificial neural networks (2)
sai anjaneya
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
EdutechLearners
 
Soft Computering Technics - Unit2
Soft Computering Technics - Unit2Soft Computering Technics - Unit2
Soft Computering Technics - Unit2
sravanthi computers
 
nncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdfnncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdf
GayathriRHICETCSESTA
 
nncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdfnncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdf
GayathriRHICETCSESTA
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksFrancesco Collova'
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
ssuserab4f3e
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
Prakash K
 

Similar to MLIP - Chapter 2 - Preliminaries to deep learning (20)

JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
 
Perceptron
PerceptronPerceptron
Perceptron
 
10-Perceptron.pdf
10-Perceptron.pdf10-Perceptron.pdf
10-Perceptron.pdf
 
SOFT COMPUTERING TECHNICS -Unit 1
SOFT COMPUTERING TECHNICS -Unit 1SOFT COMPUTERING TECHNICS -Unit 1
SOFT COMPUTERING TECHNICS -Unit 1
 
Deep learning algorithms
Deep learning algorithmsDeep learning algorithms
Deep learning algorithms
 
Neural
NeuralNeural
Neural
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronMachine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - Perceptron
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural Networks
 
Artificial Neural Network
Artificial Neural Network Artificial Neural Network
Artificial Neural Network
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
lecture07.ppt
lecture07.pptlecture07.ppt
lecture07.ppt
 
Neural Networks and Deep Learning: An Intro
Neural Networks and Deep Learning: An IntroNeural Networks and Deep Learning: An Intro
Neural Networks and Deep Learning: An Intro
 
Artificial neural networks (2)
Artificial neural networks (2)Artificial neural networks (2)
Artificial neural networks (2)
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
Soft Computering Technics - Unit2
Soft Computering Technics - Unit2Soft Computering Technics - Unit2
Soft Computering Technics - Unit2
 
nncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdfnncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdf
 
nncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdfnncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdf
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 

More from Charles Deledalle

MLIP - Chapter 4 - Image classification and CNNs
MLIP - Chapter 4 - Image classification and CNNsMLIP - Chapter 4 - Image classification and CNNs
MLIP - Chapter 4 - Image classification and CNNs
Charles Deledalle
 
IVR - Chapter 4 - Variational methods
IVR - Chapter 4 - Variational methodsIVR - Chapter 4 - Variational methods
IVR - Chapter 4 - Variational methods
Charles Deledalle
 
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsIVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methods
Charles Deledalle
 
IVR - Chapter 6 - Sparsity, shrinkage and wavelets
IVR - Chapter 6 - Sparsity, shrinkage and waveletsIVR - Chapter 6 - Sparsity, shrinkage and wavelets
IVR - Chapter 6 - Sparsity, shrinkage and wavelets
Charles Deledalle
 
IVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learningIVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learning
Charles Deledalle
 
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb) IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
Charles Deledalle
 
IVR - Chapter 3 - Basics of filtering II: Spectral filters
IVR - Chapter 3 - Basics of filtering II: Spectral filtersIVR - Chapter 3 - Basics of filtering II: Spectral filters
IVR - Chapter 3 - Basics of filtering II: Spectral filters
Charles Deledalle
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
Charles Deledalle
 

More from Charles Deledalle (8)

MLIP - Chapter 4 - Image classification and CNNs
MLIP - Chapter 4 - Image classification and CNNsMLIP - Chapter 4 - Image classification and CNNs
MLIP - Chapter 4 - Image classification and CNNs
 
IVR - Chapter 4 - Variational methods
IVR - Chapter 4 - Variational methodsIVR - Chapter 4 - Variational methods
IVR - Chapter 4 - Variational methods
 
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsIVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methods
 
IVR - Chapter 6 - Sparsity, shrinkage and wavelets
IVR - Chapter 6 - Sparsity, shrinkage and waveletsIVR - Chapter 6 - Sparsity, shrinkage and wavelets
IVR - Chapter 6 - Sparsity, shrinkage and wavelets
 
IVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learningIVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learning
 
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb) IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
IVR - Chapter 2 - Basics of filtering I: Spatial filters (25Mb)
 
IVR - Chapter 3 - Basics of filtering II: Spectral filters
IVR - Chapter 3 - Basics of filtering II: Spectral filtersIVR - Chapter 3 - Basics of filtering II: Spectral filters
IVR - Chapter 3 - Basics of filtering II: Spectral filters
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 

Recently uploaded

Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
MuhammadTufail242431
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 

Recently uploaded (20)

Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 

MLIP - Chapter 2 - Preliminaries to deep learning

  • 1. ECE 285 Machine Learning for Image Processing Chapter II – Preliminaries to deep learning Charles Deledalle March 11, 2019 1
  • 2. Deep learning – What is deep learning? What is deep learning? • Part of the machine learning field of learning representations of data. Exceptionally effective at learning patterns. • Utilizes learning algorithms that derive meaning out of data by using a hierarchy of multiple layers that mimic the neural networks of our brain. • If you provide the system tons of information, it begins to understand it and respond in useful ways. • Rebirth of artificial neural networks. (Source: Lucas Masuch) 2
  • 3. Machine learning – Brief history Brief history • First wave • 1943. McCulloch and Pitts proposed the first neural model, • 1958. Rosenblath introduced the Perceptron, • 1969. Minsky and Papert’s book demonstrated the limitation of single layer perceptrons, and almost the whole field went into hibernation. • Second wave • 1986. Backpropagation learning algorithm was rediscovered, • 1989. Yann LeCun (re)introduced Convolutional Neural Networks, • 1998. Breakthrough of CNNs in recognizing hand-written digits. • Third wave • 2006. Deep (neural network) learning gains popularity, • 2012. Made significant breakthroughs in many applications, • 2015. AlphaGo first program to beat a professional Go player. (Source: Jun Wang) 3
  • 4. Deep learning – Interest Google NGRAM & Google Trends (Source: Lucas Masuch) 4
  • 5. Machine learning – Timeline Timeline of (deep) learning 1974 Backpropagation 1995 SVM reigns Convolution Neural Networks for Handwritten Recognition 1998 2006 Restricted Boltzmann Machine 1958 Perceptron 1969 Perceptron criticized Google Brain Project on 16k Cores 2012 2012 AlexNet wins ImageNet Perceptrons book ~1980 Multilayer network Support Vector Machines feature 1 feature2 Support Vectors Maximal Margin HyperplaneX1 X2 X3 X4 W11 W12 W13 W14 f(x) Input Weights Sum Activation Function arti cial Neuron awkward silence (AI winter) (Source: Lucas Masuch & Vincent Lepetit) 5
  • 7. Machine learning – ANN - Backpropagation Perceptron 1958 Perceptron 1969 Perceptron criticized Perceptrons book X1 X2 X3 X4 W11 W12 W13 W14 f(x) Input Weights Sum Activation Function arti cial Neuron (Source: Lucas Masuch & Vincent Lepetit) 6
  • 8. Machine learning – Perceptron Perceptron (Frank Rosenblatt, 1958) First binary classifier based on supervised learning (discrimination). Foundation of modern artificial neural networks. At that time: technological, scientific and philosophical challenges. 7
  • 9. Machine learning – Perceptron – Representation Representation of the Perceptron Parameters of the perceptron • wk: synaptic weights • b: bias ←− real parameters to be estimated. Training = adjusting the weights and biases 8
  • 10. Machine learning – Perceptron – Inspiration The origin of the Perceptron Takes inspiration from the visual system known for its ability to learn patterns. • When a neuron receives a stimulus with high enough voltage, it emits an action potential (aka, nerve impulse or spike). It is said to fire. • The perceptron mimics this activation effect: it fires only when i wixi + b > 0 y = sign(w0x0 + w1x1 + w2x2 + w3x3 + b) f(x;w) = +1 for the first class −1 for the second class 9
  • 11. Machine learning – Perceptron – Principle 1 Data are represented as vectors: 2 Collect training data with positive and negative examples: (Source: Vincent Lepetit) 10
  • 12. Machine learning – Perceptron – Principle Dot product: w, x = d i=1 wixi = wT x 3 Training: find w and b so that: • w, x + b is positive for positive samples x, • w, x + b is negative for negative samples x. (Source: Vincent Lepetit) 10
  • 13. Machine learning – Perceptron – Principle Dot product: w, x = d i=1 wixi = wT x 3 Training: find w and b so that: • w, x + b is positive for positive samples x, • w, x + b is negative for negative samples x. The equation w, x + b = 0 defines a hyperplane. The hyperplane acts as a linear separator. w is a normal vector to the hyperplane. (Source: Vincent Lepetit) 10
  • 14. Machine learning – Perceptron – Principle 4 Testing: the perceptron can now classify new examples. (Source: Vincent Lepetit) 10
  • 15. Machine learning – Perceptron – Principle 4 Testing: the perceptron can now classify new examples. • A new example x is classified positive if w, x + b is positive, (Source: Vincent Lepetit) 10
  • 16. Machine learning – Perceptron – Principle 4 Testing: the perceptron can now classify new examples. • A new example x is classified positive if w, x + b is positive, • and negative if w, x + b is negative. (Source: Vincent Lepetit) 10
  • 17. Machine learning – Perceptron – Principle 4 Testing: the perceptron can now classify new examples. • A new example x is classified positive if w, x + b is positive, • and negative if w, x + b is negative. (signed) distance of x to the hyperplane: r = w, x + b ||w|| (Source: Vincent Lepetit) 10
  • 18. Machine learning – Perceptron – Representation Alternative representation Use the zero-index to encode the bias as a synaptic weight. Simplifies algorithms as all parameters can now be treated in the same way. 11
  • 19. Machine learning – Perceptron – Training Perceptron algorithm Goal: find the vector of weights w from a labeled training dataset T How: minimize classification errors min w E(w) = − (x,d)∈T st y=d d × w, x • penalize only misclassified samples (y = d), • zero if all samples are correctly classified, • proportional to the absolute distance to the hyperplane (d × w, x < 0) 12
  • 20. Machine learning – Perceptron – Training Perceptron algorithm Algorithm: • Initialize w randomly • Repeat until convergence • For all (x, d) ∈ T • Compute: y = sign w, x • If y = d: Update: w ← w + dx • Converges to some solution if the training data are linearly separable, • Corresponds to stochastic gradient descent for E(w) (see later), • But may pick any of many solutions of varying quality. ⇒ Poor generalization error. 13
  • 21. Machine learning – Perceptron – Training Variant: ADALINE (Adaptive Linear Neuron) algorithm (Widrow & Hoff, 1960). Loss: minimize instead the least square error with the prediction without sign min w E(w) = (x,d)∈T ( w, x − d)2 Algorithm: • Initialize w randomly • Repeat until convergence • For all (x, d) ∈ T • Compute: y = w, x • Update: w ← w + γ(d − y)x, γ > 0 Also corresponds to stochastic gradient descent for E(w) (see later). 14
  • 22. Machine learning – Perceptron – Training Perceptron vs ADALINE algorithm Activation function = sign 15
  • 23. Machine learning – Perceptron – Perceptrons book Perceptrons book (Minsky and Papert, 1969) A perceptron can only classify data points that are linearly separable: +1 +1 Seen by many as a justification to stop research on perceptrons. (Source: Vincent Lepetit) 16
  • 25. Machine learning – Artificial neural network Artificial neural network (Source: Lucas Masuch & Vincent Lepetit) 17
  • 26. Machine learning – Artificial neural network Artificial neural network • Supervised learning method initially inspired by the behavior of the human brain. • Consists of the inter-connection of several small units (just like in the human brain). • Essentially numerical but can handle classification and discrete inputs with appropriate coding. • Introduced in the late 50s, very popular in the 90s, reappeared in the 2010s with deep learning. • Also referred to as Multi-Layer Perceptron (MLP). • Historically used after feature extraction. 18
  • 27. Machine learning – Artificial neural network Artificial neuron (McCulloch & Pitts, 1943) • An artificial neuron contains several incoming weighted connections, an outgoing connection and has a nonlinear activation function g. • Neurons are trained to filter and detect specific features or patterns (e.g. edge, nose) by receiving weighted input, transforming it with the activation function and passing it to the outgoing connections. • Unlike the perceptron, can be used for regression (with proper choice of g). 19
  • 28. Machine learning – ANN Artificial neural network / Multilayer perceptron / NeuralNet • Inter-connection of several artificial neurons. • Each level in the graph is called a layer: • Input layer, • Hidden layer(s), • Output layer. • Each neuron (also called node or unit) in the hidden layers acts as a classifier. • Feedforward NN (no cycle) • first and simplest type of NN, • information moves in one direction. • Recurrent NN (with cycle) • use for time sequences, • such as speech-recognition. Quiz: how many layers does this network have? 20
  • 29. Machine learning – ANN Artificial neural network / Multilayer perceptron / NeuralNet h1 = g1 w1 11x1 + w1 12x2 + w1 13x3 + b1 1 h2 = g1 w1 21x1 + w1 22x2 + w1 23x3 + b1 2 h3 = g1 w1 31x1 + w1 32x2 + w1 33x3 + b1 3 h4 = g1 w1 41x1 + w1 42x2 + w1 43x3 + b1 4 y1 = g2 w2 11h1 + w2 12h2 + w2 13h3 + w2 14h4 + b2 1 y2 = g2 w2 21h1 + w2 22h2 + w2 23h3 + w2 24h4 + b2 2 wk ij synaptic weight between previous node j and next node i at layer k. gk are any activation function applied to each coefficient of its input vector. 21
  • 30. Machine learning – ANN Artificial neural network / Multilayer perceptron / NeuralNet h1 = g1 w1 11x1 + w1 12x2 + w1 13x3 + b1 1 h2 = g1 w1 21x1 + w1 22x2 + w1 23x3 + b1 2 h3 = g1 w1 31x1 + w1 32x2 + w1 33x3 + b1 3 h4 = g1 w1 41x1 + w1 42x2 + w1 43x3 + b1 4 h = g1 (W1x + b1) y1 = g2 w2 11h1 + w2 12h2 + w2 13h3 + w2 14h4 + b2 1 y2 = g2 w2 21h1 + w2 22h2 + w2 23h3 + w2 24h4 + b2 2 y = g2 (W2h + b2) wk ij synaptic weight between previous node j and next node i at layer k. gk are any activation function applied to each coefficient of its input vector. The matrices Wk and biases bk are learned from labeled training data. 21
  • 31. Machine learning – ANN Artificial neural network / Multilayer perceptron It can have 1 hidden layer only (shallow network), It can have more than 1 hidden layer (deep network), each layer may have a different size, and hidden and output layers often have different activation functions. 22
  • 32. Machine learning – ANN Artificial neural network / Multilayer perceptron • As for the perceptron, the biases can be integrated into the weights: Wkhk−1 + bk = bk Wk ˜Wk 1 hk−1 ˜hk−1 = ˜Wk ˜hk−1 • A neural network with L layers is a function of x parameterized by ˜W : y = f(x; ˜W ) where ˜W = ( ˜W1, ˜W2, . . . , ˜WL)T • It can be defined recursively as y = f(x; ˜W ) = hL, hk = gk ˜Wk ˜hk−1 and h0 = x • For simplicity, ˜W will be denoted W (when no possible confusions). 23
  • 33. Machine learning – ANN – Activation functions Activation functions Linear units: g(a) = a y = WLhL−1 + bL hL−1 = WL−1hL−2 + bL−1 y = WLWL−1hL−2 + WLbL−1 + bL y = WL . . . W1x + L−1 k=1 WL . . . Wk+1bk + bL We can always find an equivalent network without hidden units, because composition of affine functions are affine. Sometimes used for linear dimensionality reduction (similarly to PCA). In general, non-linearity is needed to learn complex (non-linear) representations of data, otherwise the NN would be just a linear function. Otherwise, back to the problem of nonlinearly separable datasets. 24
  • 34. Machine learning – ANN – Activation functions Activation functions Threshold units: for instance the sign function g(a) = −1 if a < 0 +1 otherwise. or Heaviside (aka, step) activation functions g(a) = 0 if a < 0 1 otherwise. Discontinuities in the hidden layers make the optimization really difficult. We prefer functions that are continuous and differentiable. 25
  • 35. Machine learning – ANN – Activation functions Activation functions Sigmoidal units: for instance the hyperbolic tangent function g(a) = tanh a = ea − e−a ea + e−a ∈ [−1, 1] or the logistic sigmoid function g(a) = 1 1 + e−a ∈ [0, 1] -5 0 5 -2 -1 0 1 2 • In fact equivalent by linear transformations : tanh(a/2) = 2logistic(a) − 1 • Differentiable approximations of the step and sign functions, respectively. • Act as threshold units for large values of |a| and as linear for small values. 26
  • 36. Machine learning – ANN Sigmoidal units: logistic activation functions are used in binary classification (class C1 vs C2) as they can be interpreted as posterior probabilities: y = P(C1|x) and 1 − y = P(C2|x) The architecture of the network defines the shape of the separator... ... 0 1 0 1 0 1 0 1 0 1 0 1 Separation {x P(C1|x) = P(C2|x)} Complexity/capacity of the network ⇒ Trade-off between generalization and overfitting. 27
  • 37. Machine learning – ANN – Activation functions Activation functions “Modern” units: g(a) = max(a, 0) ReLU or g(a) = log(1 + ea ) Softplus Most neural networks use ReLU (Rectifier linear unit) – max(a, 0) – nowadays for hidden layers, since it trains much faster, is more expressive than logistic function and prevents the gradient vanishing problem (see later). (Source: Lucas Masuch) 28
  • 38. Machine learning – ANN Neural networks solve non-linear separable problems (Source: Vincent Lepetit) 29
  • 39. Machine learning – UAT Universal Approximation Theorem (Hornik et al, 1989; Cybenko, 1989) Any continuous function can be approximated by a feedforward shallow network (i.e., with 1-hidden layer only) with a sufficient number of neurons in the hidden layer. • The theorem does not say how large the network needs to be. • No guarantee that the training algorithm will be able to train the network. 30
  • 40. Tasks, architectures and loss functions
  • 41. Tasks, architectures and loss functions Approximation – Least square regression • Goal: Predict a real multivariate function. • How: estimate the coefficients W of y = f(x; W ) from labeled training examples where labels are real vectors: • Typical architecture: • Hidden layer: ReLU(a) = max(a, 0) • Linear output: g(a) = a 31
  • 42. Tasks, architectures and loss functions Approximation – Least square regression • Loss: As for the polynomial curve fitting, it is standard to consider the sum of square errors (assumption of Gaussian distributed errors) E(W ) = N i=1 ||yi − di ||2 2 = N i=1 ||f(xi ; W ) − di ||2 2 and look for W ∗ such that E(W ∗ ) = 0. Recall: SSE ≡ SSD ≡ MSE • Solution: Provided the network as enough flexibility and the size of the training set grows to infinity y = f(x; W ) = E[d|x] = dp(d|x) dd posterior mean 32
  • 43. Tasks, architectures and loss functions Binary classification – Logistic regression • Goal: Classify object x into class C1 or C2. • How: estimate the coefficients W of a real function y = f(x; W ) ∈ [0, 1] from training examples with labels 1 (for class C1) and 0 (otherwise): T = {(xi , di )}1=1..N • Typical architecture: • Hidden layer: ReLU(a) = max(a, 0) • Output layer: logistic(x) = 1 1 + e−a 33
  • 44. Tasks, architectures and loss functions Binary classification – Logistic regression • Loss: it is standard to consider the cross-entropy for two-classes (assumption of Bernoulli distributed data) E(W ) = − N i=1 di log yi + (1 − di ) log(1 − yi ) with yi = f(xi ; W ) and look for W ∗ such that E(W ∗ ) = 0. • Solution: Provided the network as enough flexibility and the size of the training set grows to infinity y = f(x; W ) = P(C1|x) posterior probability 34
  • 45. Tasks, architectures and loss functions Multiclass classification – Multivariate logistic regression (aka, multinomial classification) • Goal: Classify an object x into one among K classes C1, . . . , CK . • How: estimate the coefficients W of a multivariate function y = f(x; W ) ∈ [0, 1]K from training examples T = {(xi , di )} where di is a 1-of-K (one-hot) code • Class 1: di = (1, 0, . . . , 0)T if xi ∈ C1 • Class 2: di = (0, 1, . . . , 0)T if xi ∈ C2 • . . . • Class K: di = (0, 0, . . . , 1)T if xi ∈ CK • Remark: do not use the class index k directly as a scalar label. 35
  • 46. Tasks, architectures and loss functions Multiclass classification – Multivariate logistic regression • Typical architecture: • Hidden layer: ReLU(a) = max(a, 0) • Output layer: softmax(a)k = exp(ak) K l=1 exp(al) Softmax guarantees the outputs yk to be positive and sum to 1. Generalization of the logistic sigmoid activation function. Smooth version of winner-takes-all activation model (maxout). (largest gets +1 others get 0). 36
  • 47. Tasks, architectures and loss functions Multiclass classification – Multivariate logistic regression • Loss: it is standard to consider the cross-entropy for K classes (assumption of multinomial distributed data) E(W ) = − N i=1 K k=1 di k log yi k with yi = f(xi ; W ) and look for W ∗ such that E(W ∗ ) = 0. • Solution: Provided the network as enough flexibility and the size of the training set grows to infinity yk = fk(x; W ) = P(Ck|x) posterior probability 37
  • 48. Machine learning – ANN Multi-label classification: (aka, multi-output classification) • Goal: Classify object x into zero or several classes C1, . . . , CK . The classes need to be non-mutually exclusive. • How: Combination of binary-classification networks. • Typical architecture: • Remark: For categorical inputs, use also 1-of-K codes. 38
  • 50. Machine learning – ANN - Backpropagation Learning with backpropagation (Source: Lucas Masuch & Vincent Lepetit) 39
  • 51. Machine learning – ANN – Learning Training process • To train a neural network over a large set of labeled data, you must continuously compute the difference between the network’s predicted output and the actual output. • This difference is measured by the loss E, and the process for training a net is known as backpropagation, or backprop. • During backprop, weights and biases are tweaked slightly until the lowest possible loss is achieved. −→ Optimization: look for E = 0 • The gradient is an important aspect of this process, as a measure of how much the loss changes with respect to a change in a weight or bias value. (Source: Caner Hazırba¸s) 40
  • 52. Machine learning – ANN – Learning Training process (Source: Lucas Masuch) 41
  • 53. Machine learning – ANN – Optimization Objective: min W E(W ) ⇒ E(W ) = ∂E(W ) ∂W1 . . . ∂E(W ) ∂WL T = 0 Loss functions: recall that classical loss functions are • Square error (for regression: dk ∈ R, yk ∈ R) E(W ) = 1 2 (x,d)∈T ||y − d||2 2 = 1 2 (x,d)∈T k (yk − dk)2 • Cross-entropy (for multi-class classification: dk ∈ {0, 1}, yk ∈ [0, 1]) E(W ) = − (x,d)∈T k dk log yk Solution: no close-form solutions ⇒ use gradient descent. What is it? 42
  • 54. Machine learning – ANN – Optimization – Gradient descent An iterative algorithm trying to find a minimum of a real function. Gradient descent • Let F be a real function, differentiable and lower bounded with a L Lipschitz gradient (see next slide). Then, whatever the initialization x0 , if 0 < γ < 2/L, the sequence xt+1 = xt − γ F(xt ) , converges to a stationary point x (i.e., it cancels the gradient) F(x ) = 0 . • The parameter γ is called the step size (or learning rate in ML field). • A too small step size γ leads to slow convergence. 43
  • 55. Machine learning – ANN – Optimization – Gradient descent Lipschitz gradient • A differentiable function F has L Lipschitz gradient, if || F(x1) − F(x2)||2 L||x1 − x2||2, for all x1, x2 . • The mapping x → F(x) is necessarily continuous. • If F is twice differentiable with bounded Hessian, then L = sup x || 2 F(x) Hessian matrix of F ||2. where for a matrix A, its 2-norm ||A||2 is its maximal singular value. 44
  • 56. Machine learning – ANN – Optimization – Gradient descent Gradent descent and Lipschitz gradient Example (1d quadratic loss) • Consider: F(x) = 1 2 (x − y)2 • We have: F (x) = x − y • And: F (x) = 1 • Then: L = supx |F (x)| = 1 • Thus: The range for γ is 0 < γ < 2/L = 2 • And: xt+1 = xt − γ(xt − y) converges towards x , whatever x0 , such that F (x ) = 0 ⇒ x = y Question: what is the best value for γ? 45
  • 57. Machine learning – ANN – Optimization – Gradient descent Gradent descent and Lipschitz gradient Example (1d quartic loss) • Consider: F(x) = 1 4 (x − y)4 • We have: F (x) = (x − y)3 • And: F (x) = 3(x − y)2 • Then: L = supx |F (x)| = ∞ • Thus: There are no γ satisfying 0 < γ < 2/L • And: xt+1 = xt − γ(xt − y)3 , for γ > 0, may diverge, oscilate forever or converge to the solution y. Convergence can be obtained for some specific choice of γ and initialization x0 . 46
  • 58. Machine learning – ANN – Optimization – Gradient descent These two curves cross at x such that F(x ) = 0 47
  • 59. Machine learning – ANN – Optimization – Gradient descent Here γ is small: slow convergence 47
  • 60. Machine learning – ANN – Optimization – Gradient descent γ a bit larger: faster convergence 47
  • 61. Machine learning – ANN – Optimization – Gradient descent γ ≈ 1/L even larger: around fastest convergence 47
  • 62. Machine learning – ANN – Optimization – Gradient descent γ a bit too large: convergence slows down 47
  • 63. Machine learning – ANN – Optimization – Gradient descent γ too large: convergence too slow again 47
  • 64. Machine learning – ANN – Optimization – Gradient descent γ > 2/L: divergence 48
  • 65. Machine learning – ANN – Optimization – Gradient descent Gradient descent for convex function • If moreover F is convex F(λx1 + (1 − λ)x2) λF(x1) + (1 − λ)F(x2), ∀x1, x2, λ ∈ (0, 1) , then, the gradient descent converges towards a global minimum x ∈ argmin x F(x). • For 0 < γ < 2/L, the sequence |F(xk ) − F(x )| decays in O(1/k). • NB: All stationary points are global minima (non necessarily unique). 49
  • 66. Machine learning – ANN – Optimization – Gradient descent One-dimension Two-dimensions * * * * 50
  • 67. Machine learning – ANN – Optimization – Gradient descent Influence of the step parameter (learning rate) 51
  • 68. Machine learning – ANN – Optimization Let’s start with a single artifical neurons Example ((Batch) Perceptron algorithm) • Model: y = sign w, x • Loss: E(w) = − (x,d)∈T st y=d d × w, x • Gradient: E(w) = − (x,d)∈T st y=d d × x • Gradient descent: wt+1 ← wt + γ (x,d)∈T st yt =d d × x Convex or non-convex? 52
  • 69. Machine learning – ANN – Optimization Let’s start with a single artifical neurons Example ((Batch) ADALINE) • Model: y = w, x • Loss: E(w) = 1 2 (x,d)∈T ( w, x − d)2 =d2+wT xxT w−2dwT x • Gradient: E(w) = (x,d)∈T x(xT w y −d) • Gradient descent: wt+1 ← wt + γ (x,d)∈T (d − yt )x Convex or non-convex? 53
  • 70. Machine learning – ANN – Optimization Let’s start with a single artifical neurons Example ((Batch) ADALINE) • Model: y = w, x • Loss: E(w) = 1 2 (x,d)∈T ( w, x − d)2 =d2+wT xxT w−2dwT x • Gradient: E(w) = (x,d)∈T x(xT w y −d) • Gradient descent: wt+1 ← wt + γ (x,d)∈T (d − yt )x Convex or non-convex? If enough training samples: lim t→∞ wt = (x,d)∈T xxT Hessian −1 (x,d)∈T dx 53
  • 71. Machine learning – ANN – Optimization Back to our optimization problem In our case W → E(W ) is non-convex ⇒ No guarantee of convergence. Convergence will depends on • the initialization, • the step size γ. Because of this: • Normalizing each data point x in the range [−1, +1] is important to control for the Hessian → the stability and the speed of the algorithm, • The activation functions and the loss should be chosen to have a second derivative smaller than 1 (when combined), • The initialization should be random with well chosen variance (we will go back to this later), • If all of these are satisfied, we can generally choose γ ∈ [.001, 1]. 54
  • 72. Machine learning – ANN – Optimization Back to our optimization problem In our case W → E(W ) is non-convex ⇒ No guarantee of convergence. Even if so, the limit solution depends on: • the initialization, • the step size γ. Nevertheless, really good minima or saddle points are reached in practice by W t+1 ← W t − γ E(W t ), γ > 0 Gradient descent can be expressed coordinate by coordinate as: wt+1 i,j ← wt i,j − γ ∂E(W t ) ∂wt i,j for all weights wi,j linking a node j to a node i in the next layer. ⇒ The algorithm to compute ∂E(W ) ∂wi,j for ANNs is called backpropagation. 55
  • 73. Machine learning – ANN – Optimization Backpropagation: computation of ∂E(W ) ∂wi,j Feedforward least square regression context • Model: Feed-forward neural network. (for simplicity without bias) • Loss function: E(W ) = 1 2 (x,d)∈T k (yk − dk)2 We have: E(W ) = (x,d)∈T k 1 2 (yk − dk)2 ek Apply linearity: ∂E(W ) ∂wi,j = (x,d)∈T k ∂ek ∂wi,j 56
  • 74. Machine learning – ANN – Optimization 1. Case where wij is a synaptic weight for the output layer • j: neuron in the previous hidden layer • hj: response of hidden neuron j • wi,j: synaptic weight between j and i • yi: response of output neuron i yi = g (ai) with ai = j wi,jhj Apply chain rule: ∂E(W ) ∂wi,j = (x,d)∈T k ∂ek ∂wi,j = (x,d)∈T k ∂ek ∂yi ∂yi ∂ai ∂ai ∂wi,j 57
  • 75. Machine learning – ANN – Optimization 1. Case where wij is a synaptic weight for the output layer ek = 1 2 (yk − dk)2 ⇒ ∂ek ∂yi = yi − di if k = i 0 otherwise yi = g(ai) ⇒ ∂yi ∂ai = g (ai), ai = j wi,j hj ⇒ ∂ai ∂wi,j = hj ⇒ ∂E(W ) ∂wi,j = (x,d)∈T k ∂ek ∂yi ∂yi ∂ai ∂ai ∂wi,j = (x,d)∈T (yi − di)g (ai) δi hj = (x,d)∈T δihj where δi = k ∂ek ∂ai 58
  • 76. Machine learning – ANN – Optimization 2. Case where wij is a synaptic weight for a hidden layer • j: neuron in the previous hidden layer • hj: response of hidden neuron j • wi,j: synaptic weight between j and i • hi: response of hidden neuron i hi = g (ai) with ai = i wi,jhj Apply chain rule: ∂E(W ) ∂wi,j = (x,d)∈T k ∂ek ∂wi,j = (x,d)∈T k ∂ek ∂hi ∂hi ∂ai ∂ai ∂wi,j = (x,d)∈T k l ∂ek ∂al ∂al ∂hi ∂hi ∂ai ∂ai ∂wi,j = (x,d)∈T l k ∂ek ∂al δl ∂al ∂hi ∂hi ∂ai ∂ai ∂wi,j 59
  • 77. Machine learning – ANN – Optimization 2. Case where wij is a synaptic weight for a hidden layer al = i wl,i hi ⇒ ∂al ∂hi = wl,i hi = g(ai) ⇒ ∂hi ∂ai = g (ai), ai = j wi,j hj ⇒ ∂aj ∂wi,j = hj ⇒ ∂E(W ) ∂wi,j = (x,d)∈T l δl ∂al ∂hi ∂hi ∂ai ∂ai ∂wi,j = (x,d)∈T l wl,iδl g (ai) δi hj = (x,d)∈T δihj 60
  • 78. Machine learning – ANN – Optimization Backpropagation algorithm (Werbos, 1974 & Rumelhart, Hinton and Williams, 1986) ∂E(W ) ∂wi,j = (x,d)∈T δihj where hj = xj if j is an input node where δi = g (ai) ×    yi − di if i is the output node l wl,iδl otherwise For all input x and desired output d • Forward step: → compute the response (hj, ai and yi) of all neurons, → start from the first hidden layer and pursue towards the output one. • Backward step: → Retropropagate the error (δi) from the output layer to the first layer. Update wi,j ← wi,j − γ δjhi, and repeat everything until convergence. 61
  • 79. Machine learning – ANN – Optimization Backpropagation algorithm with matrix-vector form Easier to use matrix-vector notations for each layer: (k denotes the layer) Wk E(W ) = δkhT k−1 where h0 = x where δk = g (ak)× element-wise y − d if k is an output layer W T k+1δk+1 otherwise • x: matrix with all training input vectors in column, • d: matrix with corresponding desired target vectors in column, • y: matrix with all predictions in column, • ak = Wkhk−1: matrix with all weighted sums in column, • hk = g(ak): matrix with all hidden outputs in column, • Wk: matrix of weights at layer k, 62
  • 80. Machine learning – ANN – Optimization Backpropagation algorithm Forward phase 63
  • 81. Machine learning – ANN – Optimization Backpropagation algorithm Forward phase 63
  • 82. Machine learning – ANN – Optimization Backpropagation algorithm Forward phase 63
  • 83. Machine learning – ANN – Optimization Backpropagation algorithm Forward phase 63
  • 84. Machine learning – ANN – Optimization Backpropagation algorithm Forward phase 63
  • 85. Machine learning – ANN – Optimization Backpropagation algorithm Forward phase 63
  • 86. Machine learning – ANN – Optimization Backpropagation algorithm Forward phase 63
  • 87. Machine learning – ANN – Optimization Backpropagation algorithm Forward phase 63
  • 88. Machine learning – ANN – Optimization Backpropagation algorithm Error evaluation 63
  • 89. Machine learning – ANN – Optimization Backpropagation algorithm Backward phase 63
  • 90. Machine learning – ANN – Optimization Backpropagation algorithm Backward phase 63
  • 91. Machine learning – ANN – Optimization Backpropagation algorithm Backward phase 63
  • 92. Machine learning – ANN – Optimization Backpropagation algorithm Backward phase 63
  • 93. Machine learning – ANN – Optimization Backpropagation algorithm Backward phase 63
  • 94. Machine learning – ANN – Optimization Backpropagation algorithm Backward phase 63
  • 95. Machine learning – ANN – Optimization Backpropagation algorithm Backward phase 63
  • 96. Machine learning – ANN – Optimization Backpropagation algorithm Backward phase 63
  • 97. Machine learning – ANN – Optimization Backpropagation algorithm Backward phase 63
  • 98. Machine learning – ANN – Optimization Case where g is the logistic sigmoid function • Recall that the logistic sigmoid function is given by: g(a) = 1 1 + e−a = 1 u(a) where u(a) = 1 + e−a • We have: u (a) = −e−a • Then, we get g (a) = −u (a) u(a)2 = e−a (1 + e−a)2 = 1 + e−a (1 + e−a)2 − 1 (1 + e−a)2 = 1 1 + e−a − 1 (1 + e−a)2 = 1 1 + e−a 1 − 1 1 + e−a = g(a) (1 − g(a)) 64
  • 99. Machine learning – ANN – Optimization What if g is non-differentiable? For instance: g(a) = ReLU(a) = max(a, 0) • ReLU is continuous and almost everywhere differentiable: ∂ max(a, 0) ∂a =    1 if a > 0 0 if a < 0 undefined otherwise • Gradient descent can handle this case with a simple modification: ∂ max(a, 0) ∂a = 1 if a > 0 0 if a 0 • When g is convex, this is called a sub-gradient, and gradient descent is called sub-gradient descent. 65
  • 100. Machine learning – ANN – Optimization Let’s try to learn the x-or function import numpy as np # Training set x = np.array ([[0 ,0] , [0,1], [1,0], [1 ,1]]).T d = np.array ([ 0, +1, +1, 0 ]).T # Initialization for a 2 layer feedforward network b1 = np.random.rand(2, 1) W1 = np.random.rand(2, 2) b2 = np.random.rand(1, 1) W2 = np.random.rand(1, 2) # Activation functions and their derivatives def g1(a): return a * (a > 0) # ReLU def g1p(a): return 1 * (a > 0) def g2(a): return a # Linear def g2p(a): return 1 66
  • 101. Machine learning – ANN – Optimization Let’s try to learn the x-or function import numpy as np # Training set x = np.array ([[0 ,0] , [0,1], [1,0], [1 ,1]]).T d = np.array ([ 0, +1, +1, 0 ]).T # Initialization for a 2 layer feedforward network b1 = np.random.rand(2, 1) W1 = np.random.rand(2, 2) b2 = np.random.rand(1, 1) W2 = np.random.rand(1, 2) # Activation functions and their derivatives def g1(a): return a * (a > 0) # ReLU def g1p(a): return 1 * (a > 0) def g2(a): return 1 / (1 + np.exp(-a)) # Logistic def g2p(a): return g2(a) * (1 - g2(a)) 66
  • 102. Machine learning – ANN – Optimization Let’s try to learn the x-or function gamma = .01 # step parameter ( learning rate) f o r t i n range (0, 10000): # Forward phase a1 = W1.dot(x) h1 = g1(a1) a2 = W2.dot(h1) y = g2(a2) # Error evaluation e = y - d # Backward phase delta2 = g2p(a2) * e delta1 = g1p(a1) * W2.T.dot(delta2) # gradient update W2 = W2 - gamma * delta2.dot(h1.T) W1 = W1 - gamma * delta1.dot(x.T) _ _ 67
  • 103. Machine learning – ANN – Optimization Let’s try to learn the x-or function Remark 1: dealing with the bias is similar gamma = .01 # step parameter f o r t i n range (0, 10000): # Forward phase a1 = W1.dot(x) + b1 h1 = g1(a1) a2 = W2.dot(h1) + b2 y = g2(a2) # Error evaluation e = y - d # Backward phase delta2 = g2p(a2) * e delta1 = g1p(a1) * W2.T.dot(delta2) # gradient update W2 = W2 - gamma * delta2.dot(h1.T) W1 = W1 - gamma * delta1.dot(x.T) b2 = b2 - gamma * delta2.sum(axis=1, keepdims=True) b1 = b1 - gamma * delta1.sum(axis=1, keepdims=True) 67
  • 104. Machine learning – ANN – Optimization Let’s try to learn the x-or function Remark 1: dealing with the bias is similar Remark 2: using cross-entropy is simple too gamma = .01 # step parameter f o r t i n range (0, 10000): # Forward phase a1 = W1.dot(x) + b1 h1 = g1(a1) a2 = W2.dot(h1) + b2 y = g2(a2) # Error evaluation e = -d / y + (1 - d) / (1 - y) # Backward phase delta2 = g2p(a2) * e delta1 = g1p(a1) * W2.T.dot(delta2) # gradient update W2 = W2 - gamma * delta2.dot(h1.T) W1 = W1 - gamma * delta1.dot(x.T) b2 = b2 - gamma * delta2.sum(axis=1, keepdims=True) b1 = b1 - gamma * delta1.sum(axis=1, keepdims=True) 67
  • 105. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 1 68
  • 106. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 11 68
  • 107. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 21 68
  • 108. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 31 68
  • 109. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 41 68
  • 110. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 51 68
  • 111. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 61 68
  • 112. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 71 68
  • 113. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 81 68
  • 114. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 91 68
  • 115. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 101 68
  • 116. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 201 68
  • 117. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 301 68
  • 118. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 401 68
  • 119. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 501 68
  • 120. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 601 68
  • 121. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 701 68
  • 122. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 801 68
  • 123. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 901 68
  • 124. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 1001 68
  • 125. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 2001 68
  • 126. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 3001 68
  • 127. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 4001 68
  • 128. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 5001 68
  • 129. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 6001 68
  • 130. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 7001 68
  • 131. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 8001 68
  • 132. Machine learning – ANN – Optimization Let’s see how it works 0.0 0.2 0.4 0.6 0.8 1.0 x1 0.0 0.2 0.4 0.6 0.8 1.0 x2 0.0 0.2 0.4 0.6 0.8 1.0 Iteration 9001 68
  • 133. Machine learning – ANN – Optimization Comparisons – different initializations & activations Success (Global minima) LinearoutputLogisticoutput 69
  • 134. Machine learning – ANN – Optimization Comparisons – different initializations & activations Failures (Local minima or saddle points) LinearoutputLogisticoutput 70
  • 135. Machine learning – ANN – Optimization The 1990s view of ANN and back-propagation Advantages • Universal approximators, • May be very accurate. Drawbacks • Black box models (very difficult to interpret). • The learning time does not scale well • it was very slow in networks with multiple hidden layers. • It got stuck at local optima or saddle points • these can be nevertheless often surprisingly good. • All global minima do not have the same quality (generalization error) • strong variations in prediction errors on different testing datasets. 71
  • 137. Machine learning – Support Vector Machine Support Vector Machine (Source: Lucas Masuch & Vincent Lepetit) 72
  • 138. Machine learning – Support Vector Machine Support Vector Machine (Vapnik, 1995) • Very successful methods in the mid-90’s. • Developed for binary classification. • Clever type of perceptron. • Based on two major ideas 1 large margin, 2 kernel trick. Many NNs researchers switched to SVMs in the 1990s because they (used to) work better. 73
  • 139. Machine learning – Support Vector Machine Shattering points with oriented hyperplanes • Goal: Build hyperplanes that separate points in two classes, • Question: Which is the best separating line? Remember, hyperplane: w, x + b = 0 74
  • 140. Machine learning – Support Vector Machine Classification margin • Signed distance from an example to the separator: r = x, w + b ||w|| • Examples closest to the hyperplane are support vectors. • Margin ρ is the distance between the separator and the support vector(s). • Is this one good? 75
  • 141. Machine learning – Support Vector Machine Classification margin • Signed distance from an example to the separator: r = x, w + b ||w|| • Examples closest to the hyperplane are support vectors. • Margin ρ is the distance between the separator and the support vector(s). • Is this one good? • Consider two testing samples, where are they assigned? 75
  • 142. Machine learning – Support Vector Machine Classification margin • Signed distance from an example to the separator: r = x, w + b ||w|| • Examples closest to the hyperplane are support vectors. • Margin ρ is the distance between the separator and the support vector(s). • Is this one good? • Consider two testing samples, where are they assigned? 75
  • 143. Machine learning – Support Vector Machine Classification margin • Signed distance from an example to the separator: r = x, w + b ||w|| • Examples closest to the hyperplane are support vectors. • Margin ρ is the distance between the separator and the support vector(s). • Is this one good? • Consider two testing samples, where are they assigned? • This separator may have large generalization error. 75
  • 144. Machine learning – Support Vector Machine Largest margin = Support Vector Machine • Maximizing the margin reduces generalization error (see, PAC learning). • Then, there are necessary support vectors on both sides of the hyperplane. • Only support vectors are important ⇒ other examples can be discarded. Poor separation Optimal separation 76
  • 145. Machine learning – Support Vector Machine Training • As for neural networks, the parameters w and b are obtained by optimizing a loss function (the margin). • What is the formula for the margin? • Training dataset: feature vectors xi , targeted class di ∈ {−1, +1} • Assume the training samples to be linearly separable, i.e., there exist w and b such that xi , w + b +1 if di = +1 xi , w + b −1 if di = −1 di ( xi , w + b) 1 Then: | xi , w + b| 1 77
  • 146. Machine learning – Support Vector Machine Training • For support vectors, the inequality can be forced to be an equality | xi , w + b| = 1 • Then, the distance between any support vector and the hyperplane is |r| = | xi , w + b| ||w|| = 1 ||w|| • So, the margin is (by definition) ρ = |r| = 1 ||w|| • Maximizing the margin is then equivalent to minimizing ||w||2 , providing the hyperplane (w, b) separates the data. 78
  • 147. Machine learning – Support Vector Machine Training • Therefore, the problem can be recast as min w,b 1 2 ||w||2 subject to di ( xi , w + b) 1, for all i. ⇒ Quadratic (convex) optimization problem subject to linear constraints, ⇒ No local minima! Only a single global one. Equivalent formulation: min w,b E(w, b) where E(w, b) = 1 2 ||w||2 if di ( xi , w + b) 1, for all i +∞ otherwise 79
  • 148. Machine learning – Support Vector Machine Training min w,b E(w, b) where E(w, b) = 1 2 ||w||2 if di ( xi , w + b) 1, for all i +∞ otherwise • We can use the techniques of Lagrange multipliers: 1 Introduce a Lagrange multiplier αi 0 for each constraint. 2 Let α = (α1, α2, . . .)T be the vector of Lagrange multipliers. 3 Define the (primal) function as LP (w, b, α) = 1 2 ||w||2 − i αi (di ( xi , w + b) − 1) inequality constraint 4 Solve the saddle point problem E(w, b) = max α LP (w, b, α) ⇒ min w,b E(w, b) = min w,b max α LP (w, b, α) 80
  • 149. Machine learning – Support Vector Machine Training • When the original objective-function is convex (and only then), we can interchange the minimization and maximization min w,b max α LP (w, b, α) = max α min w,b LP (w, b, α) • The following is called dual problem LD(α) = min w,b    LP (w, b, α) = 1 2 ||w||2 w, w − i αi (di ( xi , w + b) − 1)    • For a fixed α, the minimum is achieved by simultaneously canceling the gradients with respect to w and b w LP (w, b, α) = 0 ⇒ w − i αi di xi = 0 ⇒ w = i αi di xi and b LP (w, b, α) = 0 ⇒ i αi di = 0 81
  • 150. Machine learning – Support Vector Machine Training The dual is obtained by plugging these equations in LP (w, b, α): LD(α) = 1 2 ||w||2 w, w − i αi (di ( xi , w + b) − 1) = 1 2 i αi di xi , j αj dj xj − i αi di ( xi , j αj dj xj + b) + i αi = 1 2 i j αi αj di dj xi , xj − i j αi αj di dj xi , xj + b i αi di =0 + i αi = i αi − 1 2 i j αi αj di dj xi , xj 82
  • 151. Machine learning – Support Vector Machine SVM training algorithm 1 Maximize the dual (e.g., using coordinate projected gradient descent): max α LD(α) = i αi − 1 2 i j αi αj di dj xi , xj subject to αi 0 and i αi di = 0, for all i. 2 Each non-zero αi indicates that corresponding xi is a support vector. 3 Deduce the parameters from these support vectors (s.v) w = i (s.v.) αi di xi and b = di − j (s.v.) αj dj xi , xj for any s.v. i • The dual is in practice more efficient to solve than the original problem, and it allows identifying the support vectors. • It only depends on the dot products xi , xj between all training points. 83
  • 152. Machine learning – Support Vector Machine SVM classification algorithm • Given the parameters of the hyperplane w and b, the SVM classify a new sample x as y = w, x + b 0 (same as for the perceptron). • But (very important), this can be reformulated as y = i (s.v.) αi di x, xi + b which only depends on the dot products x, xi between x and the support vectors – we will return to this later. 84
  • 153. Machine learning – Support Vector Machine Non-separable case What if the training set is not linearly separable? The optimization problem does not admit solutions. 85
  • 154. Machine learning – Support Vector Machine Non-separable case What if the training set is not linearly separable? The optimization problem does not admit solutions. Solution: allows the system to make errors εi. 85
  • 155. Machine learning – Support Vector Machine Non-separable case – Soft-margin SVM • Just relax the constraints by permitting errors to some extent min w,b,ε 1 2 ||w||2 + C i εi subject to di ( xi , w + b) 1 − εi and εi 0, for all i. • Quantities εi are called slack variables. • The parameter C > 0 is chosen by the user and controls overfitting. • The dual problem becomes max α LD(α) = i αi − 1 2 i j αi αj di dj xi , xj subject to 0 αi C and i αi di = 0, for all i. and does not depend on the slack variables εi. 86
  • 156. Machine learning – Support Vector Machine Linear SVM – Overview • SVM finds the separating hyperplane maximizing the margin. • Soft-margin SVM: trade-off between large margin and errors. • These optimal hyperplanes are only defined in terms of support vectors. • Lagrangian formulation allows identifying the support vectors with non-zero Lagrange multipliers αi . • The training and classification algorithms both depend only on dot products between data points. So far, the SVM is a linear separator (the best in some sense). But, it cannot solve non-linear problems (such as xor), as done by multi-layer neural networks. 87
  • 157. Machine learning – Support Vector Machine Non-linear SVM – Overview What happens if the separator is non-linear? 88
  • 158. Machine learning – Support Vector Machine Non-linear SVM – Map to higher dimensions Here: (x1, x2) → (x2 1, x2 2, √ 2x1x2) The input space (original feature space) can always be mapped to some higher-dimensional feature space where the training data set is linearly separable, via some (non-linear) transformation x → ϕ(x). 89
  • 159. Machine learning – Support Vector Machine Non-linear SVM – Map to higher dimensions Replace all occurrences of x by ϕ(x), for the training step LD(α) = i αi − 1 2 i j αi αj di dj ϕ(xi ), ϕ(xj ) and the classification step y = i (s.v.) αi di ϕ(x), ϕ(xi ) + b • May require a feature space of really high (even infinite) dimension. • In this case, manipulating ϕ(x) might be tough, impossible, or lead to intense computation: complexity depends on the feature space dimension. 90
  • 160. Machine learning – Support Vector Machine Non-linear SVM – Kernel trick • SVMs do not care about feature vectors ϕ(x). • They rely only on dot products between them. Define K(x, x ) = ϕ(x), ϕ(x ) • K is called kernel function: dot product in a higher dimensional space. • Even if ϕ(x) is tough to manipulate, K(x, x ) can be rather simple. • Mercer’s theorem: K continuous and symmetric positive semi-definite (i.e., the matrix K = (K(xi , xj ))i,j is symmetric positive semi-definite for all finite sequences x1 , . . . , xn ) then there exists a mapping ϕ such that K(x, x ) = ϕ(x), ϕ(x ) • We do not even need to know ϕ, pick a continuous symmetric positive semi-definite kernel K and use it instead of dot products. 91
  • 161. Machine learning – Support Vector Machine Non-linear SVM – Kernel trick • Replace all occurrences of x, x by K(x, x ), for the training step LD(α) = i αi − 1 2 i j αi αj di dj K(xi , xj ) and the classification step y = i (s.v.) αi di K(x, xi ) + b • Python implementation available in scikit-learn. Complexity depends on the input space dimension. 92
  • 162. Machine learning – Support Vector Machine Non-linear SVM – Standard kernels • Linear: K(x, x ) = x, x • Polynomial: K(x, x ) = (γ x, x + β)p • Gaussian: K(x, x ) = exp −γ||x − x ||2 −→ Radial basis function (RBF) network. • Sigmoid: K(x, x ) = tanh(γ x, x + β) In practice: K is usually chosen by trial and error. A linear SVM is an optimal perceptron but what is a non-linear SVM? 93
  • 163. Machine learning – Support Vector Machine Non-linear SVMs are shallow ANNs Consider: K(x, x ) = tanh(γ x, x + β) We get: y = i (s.v.) αi di K(x, xi ) + b ← SVM y = i (s.v.) αi di tanh(γ x, xi + β) + b y = w2, g(W1x + b1) + b2 ← Shallow ANN where W1 = γ x1 x2 . . . support vectors T , b1 = β1, w2 = α1 d1 α2 d2 . . . Lagrande multipliers and desired class of support vectors , b2 = b, and g = tanh . The number of support vectors is the number of hidden units. 94
  • 164. Machine learning – Support Vector Machine Difference between SVMs and ANNs • SVM:    • based on stronger mathematical theory, • avoid over-fitting, • not trapped in local-minima, • works well with fewer training samples. • But limited to binary classification → extensions to regression in (Vapnik et al., 1997) and PCA in (Sch¨olkopf et al, 1999). • Tuning SVMs is tough: → selecting a specific kernel (feature space) and regularization parameter C done by trial and errors. 95
  • 165. Machine learning – Support Vector Machine Similarity between SVMs and ANNs (in binary classification) • Similar formalisms: y = w, ϕ(x) + b 0 ? SVM • ϕ(x): feature vector, • ϕ: non-linear function implicitly defined by the kernel K, • UAT: ϕ could be approximated by a shallow NN. ANN • ϕ(x): output of the last hidden layer, • ϕ: non-linear recursive function learned by backpropagation, • the output of hidden layers can be interpreted as feature vectors. 96
  • 166. Machine learning – Support Vector Machine Similarity between SVMs and ANNs (in binary classification) • Both are linear separators in a suitable feature space, • Mapping to the feature space is obtained by a non-linear transform, • For SVM, the non-linear transform is chosen by the user. Instead, ANNs learn it with backpropagation. • ANNs can also do it recursively layer by layer: → this is called a feature hierarchy, → this is the foundation of deep learning. 97
  • 167. Machine learning – Deep learning What’s next? (Source: Lucas Masuch & Vincent Lepetit) 98
  • 168. Questions? Next class: Introduction to deep learning Slides and images courtesy of • P. Gallinari • V. Lepetit • A. O. i Puig • L. Masuch • A. W. Moore • C. Hazırba¸s • M. Welling 98