Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018

[course site]
Javier Ruiz Hidalgo
javier.ruiz@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Loss functions
Day 3 Lecture 1
#DLUPC

Me
● Javier Ruiz Hidalgo
○ Email: j.ruiz@upc.edu
○ Office: UPC, Campus Nord, D5-008
● Teaching experience
○ Basic signal processing
○ Project based
○ Image processing & computer vision
● Research experience
○ Master on hierarchical image representations by UEA (UK)
○ PhD on video coding by UPC (Spain)
○ Interests in image & video coding, 3D analysis and super-resolution
2

Acknowledgements
3
Kevin McGuinness
kevin.mcguinness@dcu.ie
Research Fellow
Insight Centre for Data Analytics
Dublin City University
Xavier Giró
xavier.giro@upc.edu
Associate Professor
ETSETB TelecomBCN
Universitat Politècnica de Catalunya

Motivation
4
Model
● 3 layer neural network (2 hidden layers)
● Tanh units (activation function)
● 512-512-10
● Softmax on top layer
● Cross entropy loss
28
28
Layer #Weights #Biases Total
1 784 x 512 512 401,920
2 512 x 512 512 262,656
3 512 x 10 10 5,130
669,706
??

Outline
● Introduction
○ Definition, properties, training process
● Common types of loss functions
○ Regression
○ Classification
● Regularization
● Example
5

Nomenclature
loss function
cost function
objective function
error function
6
=
=
=

Definition
In a supervised deep learning context the loss function
measures the quality of a particular set of parameters
based on how well the output of the network agrees with
the ground truth labels in the training data.
7

Loss function (1)
8
How good does
our network with
the training data?
labels (ground truth)
input
parameters (weights, biases)error
Deep Network
input output

Loss function (2)
● The loss function does not want to measure the entire
performance of the network against a validation/test
dataset.
9

Loss function (3)
● The loss function is used to
guide the training process
in order to find a set of
parameters that reduce the
value of the loss function.
10

Training process
Stochastic gradient descent
● Find a set of parameters which make the loss as small
as possible.
● Change parameters at a rate determined by the partial
derivatives of the loss function:
11

Properties (1)
● Minimum (0 value) when the
output of the network is equal
to the ground truth data.
● Increase value when output
differs from ground truth.
12
epoch
Loss
Similar to GTDifferent from GT

Properties (2)
13
● Ideally → convex function

Properties (2)
14
● In reality → some many parameters (in the order of
millions) than it is not convex

Properties (3)
15
● Varies smoothly with changes on the output
○ Better gradients for gradient descent
○ Easy to compute small changes in the parameters to get an
improvement in the loss

Assumptions
● For backpropagation to work:
○ Loss function can be written as an average over loss
functions for individual training examples:
○ Loss functions can be written as a function of the
output activations from the neural network.
16Neural Networks and Deep Learning
empirical risk

Outline
● Introduction
○ Regression
○ Classification
● Regularization
● Example
17

Common types of loss functions (1)
● Loss functions depend on the type of task:
○ Regression: the network predicts continuous,
numeric variables
■ Example: Length of fishes in images, price of a
house
■ Absolute value, square error
18

19
Loss functions for regression
L1 Distance /
Absolute Value
L2 Distance /
Euclidian Distance / Square Value
Lighter computation
(only |·|)
Heavier computation
(square)

20
L1 Distance /
Absolute Value
L2 Distance /
Less penalty to errors > 1
More penalty to errors < 1
More penalty to error > 1
Less penalty to errors < 1

21
L1 Distance /
Absolute Value
L2 Distance /
Less sensitive to outliers,
which are very high yi
-fθ
(xi
)
More sensitive to outliers,
which are very high yi
-fθ
(xi
)

Outline
● Introduction
○ Regression
○ Classification
● Regularization
● Example
22

Common types of loss functions (2)
● Loss functions depend on the type of task:
○ Classification: the network predicts categorical
variables (fixed number of classes)
■ Example: classify email as spam, classify images
of numbers.
■ hinge loss, Cross-entropy loss
23

Classification (1)
We want the network to classify the input into a
fixed number of classes
24
class “1”
class “2”
class “3”

Classification (2)
● Each input can have only one label
○ One prediction per output class
■ The network will have “k” outputs (number of classes)
25
0.1
2
1
class “1”
class “2”
class “3”
input
Network
output
scores / logits

Classification (3)
● How can we create a loss function to improve the scores?
○ Somehow write the labels (ground truth of the data) into a vector →
One-hot encoding
○ Non-probabilistic interpretation → hinge loss
○ Probabilistic interpretation: need to transform the scores into a probability
function → Softmax
26

One-hot encoding
● Transform each label into a vector (with only 1 and 0)
○ Length equal to the total number of classes “k”
○ Value of 1 for the correct class and 0 elsewhere
27
0
1
0
class “2”
1
0
0
class “1”
0
0
1
class “3”

Softmax (1)
● Convert scores into confidences/probabilities
○ From 0.0 to 1.0
○ Probability for all classes adds to 1.0
28
0.1
2
1
input
Network
output
scores / logits
0.1
0.7
0.2
probability

Softmax (2)
● Softmax function
29
0.1
2
1
input
Network
output
scores / logits
0.1
0.7
0.2
probability
scores (logits)
Neural Networks and Deep Learning (softmax)

Cross-entropy loss (1)
31
0.1
2
1
input
Network
output
scores / logits lk
0.1
0.7
0.2
probability S(lk
)
0
1
0
labels yk
loss function
Loss
contribution
from input xi
in a training
batch

32
0.1
2
1
input
Network
output
scores / logits lk
0.1
0.7
0.2
probability S(lk
)
0
1
0
labels yk
loss function
In a one-hot vector, only
one yk
=1 (rest are 0) !!

33
0.1
2
1
input
Network
output
scores / logits lk
0.1
0.7
0.2
probability S(lk
)
0
1
0
labels yk
loss function
Only the confidence for the
GT class is considered.

34
-log penalizes small confidence
scores for GT class

● For a set of n inputs
35
labels (one-hot)
Softmax

● In general, cross-entropy loss works better than
square error loss:
○ Square error loss usually gives too much emphasis to
incorrect outputs.
○ In square error loss, as the output gets closer to either 0.0
or 1.0 the gradients get smaller, the change in weights gets
smaller and training is slower.
36

Weighted cross-entropy
37
● Dealing with class imbalance
○ Different number of samples for different classes
○ Introduce a weighing factor (related to inverse class frequency or treated as
hyperparameter)
class “ant”
class “panda”class “flamingo”

Multi-label classification (1)
● Outputs can be matched to more than one label
○ “car”, “automobile”, “motor vehicle” can be applied to a same image of a car.
38Bucak, Serhat Selcuk, Rong Jin, and Anil K. Jain. "Multi-label learning with incomplete class assignments." CVPR 2011.

● Outputs can be matched to more than one label
○ “car”, “automobile”, “motor vehicle” can be applied to a same image of a car.
● .. directly use sigmoid at each output independently instead
of softmax
39

● Cross-entropy loss for multi-label classification:
40

Outline
● Introduction
○ Regression
○ Classification
● Regularization
● Example
41

Regularization
● Control the capacity of the network to prevent
overfitting
○ L2-regularization (weight decay):
○ L1-regularization:
42
regularization parameter

Outline
● Introduction
○ Regression
○ Classification
● Regularization
● Example
43

cross-entropy loss = 1.63
Example
44Why you should use cross-entropy instead of classification error
Classification accuracy = 2/3 = 0.66
Classification accuracy = 2/3 = 0.66
0.1
0.2
0.7
Confidences S(lk
)
0.3
0.4
0.3
0.3
0.3
0.4
0.3
0.3
0.4
Confidences S(lk
)
0.1
0.7
0.2
0.1
0.2
0.7
Ground Truth
1
0
0
0
1
0
0
0
1
Predictions
0
0
1
0
1
0
0
0
1

Example
a) Compute the accuracy of the predictions.
0.1
0.2
0.7
Confidences S(lk
) Predictions
0.3
0.4
0.3
0.3
0.3
0.4
0
0
1
0
1
0
0
0
1
The class with the maximum confidence is predicted.
Ground Truth
1
0
0
0
1
0
0
0
1

Example
46
a) Compute the accuracy of the predictions.
0.1
0.2
0.7
Confidences S(lk
) Predictions
0.3
0.4
0.3
0.3
0.3
0.4
0
0
1
0
1
0
0
0
1
Class predictions and ground truth can be compared with a confusion matrix.
Ground Truth
1
0
0
0
1
0
0
0
1
Prediction
Class
1
Class
2
Class
3
GT
Class
1
0 0 1
Class
2
0 1 0
Class
3
0 0 1
Accuracy =2 / 3 = 0.66 Accuracy =2 / 3 = 0.66

Example
Classification accuracy = 2/3
a) Compute the accuracy of the predictions
b) Compute the cross-entropy loss of the prediction
0.1
0.2
0.7
Confidences S(lk
)
0.3
0.4
0.3
0.3
0.3
0.4
Ground Truth y
1
0
0
0
1
0
0
0
1

Example
0.1
0.2
0.7
Confidences S(lk
)
0.3
0.4
0.3
0.3
0.3
0.4
Ground Truth y
1
0
0
0
1
0
0
0
1
-log
2.3 0.92 0.92

Example
0.1
0.2
0.7
Confidences S(lk
)
0.3
0.4
0.3
0.3
0.3
0.4
Ground Truth y
1
0
0
0
1
0
0
0
1
L=4.14
-log
2.3 0.92 0.92

Assignment
50
c) Compare the result with the one obtained with the example and discuss how accuray and
cross-entropy have captured the quality of the predictions.
0.3
0.3
0.4
Ground Truth
0.1
0.7
0.2
0.1
0.2
0.7
1
0
0
0
1
0
0
0
1
Confidences S(lk
)
Why you should use cross-entropy instead of classification error

References
● About loss functions
● Neural networks and deep learning
● Are loss functions all the same?
● Convolutional neural networks for Visual Recognition
● Deep learning book, MIT Press, 2016
● On Loss Functions for Deep Neural Networks in
Classification
51

Thanks! Questions?
52
https://imatge.upc.edu/web/people/javier-ruiz-hidalgo

Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018

Similar to Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018 (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018