Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018

119 views

Published on

https://telecombcn-dl.github.io/2018-dlai/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018

  1. 1. [course site] Javier Ruiz Hidalgo javier.ruiz@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Loss functions Day 3 Lecture 1 #DLUPC
  2. 2. Me ● Javier Ruiz Hidalgo ○ Email: j.ruiz@upc.edu ○ Office: UPC, Campus Nord, D5-008 ● Teaching experience ○ Basic signal processing ○ Project based ○ Image processing & computer vision ● Research experience ○ Master on hierarchical image representations by UEA (UK) ○ PhD on video coding by UPC (Spain) ○ Interests in image & video coding, 3D analysis and super-resolution 2
  3. 3. Acknowledgements 3 Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University Xavier Giró xavier.giro@upc.edu Associate Professor ETSETB TelecomBCN Universitat Politècnica de Catalunya
  4. 4. Motivation 4 Model ● 3 layer neural network (2 hidden layers) ● Tanh units (activation function) ● 512-512-10 ● Softmax on top layer ● Cross entropy loss 28 28 Layer #Weights #Biases Total 1 784 x 512 512 401,920 2 512 x 512 512 262,656 3 512 x 10 10 5,130 669,706 ??
  5. 5. Outline ● Introduction ○ Definition, properties, training process ● Common types of loss functions ○ Regression ○ Classification ● Regularization ● Example 5
  6. 6. Nomenclature loss function cost function objective function error function 6 = = =
  7. 7. Definition In a supervised deep learning context the loss function measures the quality of a particular set of parameters based on how well the output of the network agrees with the ground truth labels in the training data. 7
  8. 8. Loss function (1) 8 How good does our network with the training data? labels (ground truth) input parameters (weights, biases)error Deep Network input output
  9. 9. Loss function (2) ● The loss function does not want to measure the entire performance of the network against a validation/test dataset. 9
  10. 10. Loss function (3) ● The loss function is used to guide the training process in order to find a set of parameters that reduce the value of the loss function. 10
  11. 11. Training process Stochastic gradient descent ● Find a set of parameters which make the loss as small as possible. ● Change parameters at a rate determined by the partial derivatives of the loss function: 11
  12. 12. Properties (1) ● Minimum (0 value) when the output of the network is equal to the ground truth data. ● Increase value when output differs from ground truth. 12 epoch Loss Similar to GTDifferent from GT
  13. 13. Properties (2) 13 ● Ideally → convex function
  14. 14. Properties (2) 14 ● In reality → some many parameters (in the order of millions) than it is not convex
  15. 15. Properties (3) 15 ● Varies smoothly with changes on the output ○ Better gradients for gradient descent ○ Easy to compute small changes in the parameters to get an improvement in the loss
  16. 16. Assumptions ● For backpropagation to work: ○ Loss function can be written as an average over loss functions for individual training examples: ○ Loss functions can be written as a function of the output activations from the neural network. 16Neural Networks and Deep Learning empirical risk
  17. 17. Outline ● Introduction ○ Definition, properties, training process ● Common types of loss functions ○ Regression ○ Classification ● Regularization ● Example 17
  18. 18. Common types of loss functions (1) ● Loss functions depend on the type of task: ○ Regression: the network predicts continuous, numeric variables ■ Example: Length of fishes in images, price of a house ■ Absolute value, square error 18
  19. 19. 19 Loss functions for regression L1 Distance / Absolute Value L2 Distance / Euclidian Distance / Square Value Lighter computation (only |·|) Heavier computation (square)
  20. 20. 20 Loss functions for regression L1 Distance / Absolute Value L2 Distance / Euclidian Distance / Square Value Less penalty to errors > 1 More penalty to errors < 1 More penalty to error > 1 Less penalty to errors < 1
  21. 21. 21 Loss functions for regression L1 Distance / Absolute Value L2 Distance / Euclidian Distance / Square Value Less sensitive to outliers, which are very high yi -fθ (xi ) More sensitive to outliers, which are very high yi -fθ (xi )
  22. 22. Outline ● Introduction ○ Definition, properties, training process ● Common types of loss functions ○ Regression ○ Classification ● Regularization ● Example 22
  23. 23. Common types of loss functions (2) ● Loss functions depend on the type of task: ○ Classification: the network predicts categorical variables (fixed number of classes) ■ Example: classify email as spam, classify images of numbers. ■ hinge loss, Cross-entropy loss 23
  24. 24. Classification (1) We want the network to classify the input into a fixed number of classes 24 class “1” class “2” class “3”
  25. 25. Classification (2) ● Each input can have only one label ○ One prediction per output class ■ The network will have “k” outputs (number of classes) 25 0.1 2 1 class “1” class “2” class “3” input Network output scores / logits
  26. 26. Classification (3) ● How can we create a loss function to improve the scores? ○ Somehow write the labels (ground truth of the data) into a vector → One-hot encoding ○ Non-probabilistic interpretation → hinge loss ○ Probabilistic interpretation: need to transform the scores into a probability function → Softmax 26
  27. 27. One-hot encoding ● Transform each label into a vector (with only 1 and 0) ○ Length equal to the total number of classes “k” ○ Value of 1 for the correct class and 0 elsewhere 27 0 1 0 class “2” 1 0 0 class “1” 0 0 1 class “3”
  28. 28. Softmax (1) ● Convert scores into confidences/probabilities ○ From 0.0 to 1.0 ○ Probability for all classes adds to 1.0 28 0.1 2 1 input Network output scores / logits 0.1 0.7 0.2 probability
  29. 29. Softmax (2) ● Softmax function 29 0.1 2 1 input Network output scores / logits 0.1 0.7 0.2 probability scores (logits) Neural Networks and Deep Learning (softmax)
  30. 30. 30
  31. 31. Cross-entropy loss (1) 31 0.1 2 1 input Network output scores / logits lk 0.1 0.7 0.2 probability S(lk ) 0 1 0 labels yk loss function Loss contribution from input xi in a training batch
  32. 32. Cross-entropy loss (2) 32 0.1 2 1 input Network output scores / logits lk 0.1 0.7 0.2 probability S(lk ) 0 1 0 labels yk loss function In a one-hot vector, only one yk =1 (rest are 0) !!
  33. 33. Cross-entropy loss (2) 33 0.1 2 1 input Network output scores / logits lk 0.1 0.7 0.2 probability S(lk ) 0 1 0 labels yk loss function Only the confidence for the GT class is considered.
  34. 34. Cross-entropy loss (3) 34 -log penalizes small confidence scores for GT class
  35. 35. Cross-entropy loss (3) ● For a set of n inputs 35 labels (one-hot) Softmax
  36. 36. Cross-entropy loss (4) ● In general, cross-entropy loss works better than square error loss: ○ Square error loss usually gives too much emphasis to incorrect outputs. ○ In square error loss, as the output gets closer to either 0.0 or 1.0 the gradients get smaller, the change in weights gets smaller and training is slower. 36
  37. 37. Weighted cross-entropy 37 ● Dealing with class imbalance ○ Different number of samples for different classes ○ Introduce a weighing factor (related to inverse class frequency or treated as hyperparameter) class “ant” class “panda”class “flamingo”
  38. 38. Multi-label classification (1) ● Outputs can be matched to more than one label ○ “car”, “automobile”, “motor vehicle” can be applied to a same image of a car. 38Bucak, Serhat Selcuk, Rong Jin, and Anil K. Jain. "Multi-label learning with incomplete class assignments." CVPR 2011.
  39. 39. Multi-label classification (2) ● Outputs can be matched to more than one label ○ “car”, “automobile”, “motor vehicle” can be applied to a same image of a car. ● .. directly use sigmoid at each output independently instead of softmax 39
  40. 40. Multi-label classification (3) ● Cross-entropy loss for multi-label classification: 40
  41. 41. Outline ● Introduction ○ Definition, properties, training process ● Common types of loss functions ○ Regression ○ Classification ● Regularization ● Example 41
  42. 42. Regularization ● Control the capacity of the network to prevent overfitting ○ L2-regularization (weight decay): ○ L1-regularization: 42 regularization parameter
  43. 43. Outline ● Introduction ○ Definition, properties, training process ● Common types of loss functions ○ Regression ○ Classification ● Regularization ● Example 43
  44. 44. cross-entropy loss = 1.63 cross-entropy loss = 4.14 Example 44Why you should use cross-entropy instead of classification error Classification accuracy = 2/3 = 0.66 Classification accuracy = 2/3 = 0.66 0.1 0.2 0.7 Confidences S(lk ) 0.3 0.4 0.3 0.3 0.3 0.4 0.3 0.3 0.4 Confidences S(lk ) 0.1 0.7 0.2 0.1 0.2 0.7 Ground Truth 1 0 0 0 1 0 0 0 1 Predictions 0 0 1 0 1 0 0 0 1
  45. 45. Example 45Why you should use cross-entropy instead of classification error a) Compute the accuracy of the predictions. 0.1 0.2 0.7 Confidences S(lk ) Predictions 0.3 0.4 0.3 0.3 0.3 0.4 0 0 1 0 1 0 0 0 1 The class with the maximum confidence is predicted. Ground Truth 1 0 0 0 1 0 0 0 1
  46. 46. Example 46 a) Compute the accuracy of the predictions. 0.1 0.2 0.7 Confidences S(lk ) Predictions 0.3 0.4 0.3 0.3 0.3 0.4 0 0 1 0 1 0 0 0 1 Class predictions and ground truth can be compared with a confusion matrix. Ground Truth 1 0 0 0 1 0 0 0 1 Prediction Class 1 Class 2 Class 3 GT Class 1 0 0 1 Class 2 0 1 0 Class 3 0 0 1 Accuracy =2 / 3 = 0.66 Accuracy =2 / 3 = 0.66
  47. 47. Example 47Why you should use cross-entropy instead of classification error Classification accuracy = 2/3 cross-entropy loss = 4.14 a) Compute the accuracy of the predictions b) Compute the cross-entropy loss of the prediction 0.1 0.2 0.7 Confidences S(lk ) 0.3 0.4 0.3 0.3 0.3 0.4 Ground Truth y 1 0 0 0 1 0 0 0 1
  48. 48. Example 48Why you should use cross-entropy instead of classification error a) Compute the accuracy of the predictions b) Compute the cross-entropy loss of the prediction 0.1 0.2 0.7 Confidences S(lk ) 0.3 0.4 0.3 0.3 0.3 0.4 Ground Truth y 1 0 0 0 1 0 0 0 1 -log 2.3 0.92 0.92
  49. 49. Example 49Why you should use cross-entropy instead of classification error a) Compute the accuracy of the predictions b) Compute the cross-entropy loss of the prediction 0.1 0.2 0.7 Confidences S(lk ) 0.3 0.4 0.3 0.3 0.3 0.4 Ground Truth y 1 0 0 0 1 0 0 0 1 L=4.14 -log 2.3 0.92 0.92
  50. 50. Assignment 50 a) Compute the accuracy of the predictions b) Compute the cross-entropy loss of the prediction c) Compare the result with the one obtained with the example and discuss how accuray and cross-entropy have captured the quality of the predictions. 0.3 0.3 0.4 Ground Truth 0.1 0.7 0.2 0.1 0.2 0.7 1 0 0 0 1 0 0 0 1 Confidences S(lk ) Why you should use cross-entropy instead of classification error
  51. 51. References ● About loss functions ● Neural networks and deep learning ● Are loss functions all the same? ● Convolutional neural networks for Visual Recognition ● Deep learning book, MIT Press, 2016 ● On Loss Functions for Deep Neural Networks in Classification 51
  52. 52. Thanks! Questions? 52 https://imatge.upc.edu/web/people/javier-ruiz-hidalgo

×