Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Backpropagation - Elisa Sayrol - UPC Barcelona 2018

88 views

Published on

https://telecombcn-dl.github.io/2018-dlai/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Backpropagation - Elisa Sayrol - UPC Barcelona 2018

  1. 1. [course site] Day 2 Lecture 2 Backpropagation Elisa Sayrol
  2. 2. Acknowledgements Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University Xavier Giro-i-Nieto xavier.giro@upc.edu
  3. 3. …in our previous lecture
  4. 4. Multilayer perceptrons When each node in each layer is a linear combination of all inputs from the previous layer then the network is called a multilayer perceptron (MLP) Weights can be organized into matrices. Forward pass computes !(#)
  5. 5. Training MLPs With Multiple Layer Perceptrons we need to find the gradient of the loss function with respect to all the parameters of the model (W(t), b(t)) These can be found using the chain rule of differentiation. The calculations reveal that the gradient wrt the parameters in layer k only depends on the error from the above layer and the output from the layer below. This means that the gradients for each layer can be computed iteratively, starting at the last layer and propagating the error back through the network. This is known as the backpropagation algorithm.
  6. 6. • Computational Graphs • Examples applying chain of rule in simple graphs • Backpropagation applied to Multilayer Perceptron • Another perspective: modular backprop Backpropagation algorithm
  7. 7. Computational graphs z x y x u(1) u(2) · + y^ x w b s U(1) U(2) matmul + H X W b relu u(1) u(2) · y^ x w l x u(3) sum sqrt ! = #$ %$=&(x)w + b) .=max 0, 12 + 3 %$=x)w 4 5 6 76 8 From Deep Learning Book
  8. 8. Computational graphs Applying the Chain Rule to Computational Graphs ! = #(%) ' = ( # % )' )% = )' )! )! )% )' )! )! )%* )! )%+ )' )%* )' )%+ )' )%* = )' )! )! )%* )' )%+ = )' )! )! )%+ %* %+ ! ' fg )' )! ' = ((!)
  9. 9. From computational graphs to MLP Numerical Examples From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017 + x ! ", $, % = " + $ % " $ % ( −2 5 −4 -12 3 ( = " + $ ! = (% -( -" = 1 -( -$ = 1 -! -( = % -! -% = ( /0 1234 45 6578940: -! -" , -! -$ , -! -% ;"278<0 " = −2, $ = 5, % = −4 -! -! = 1 -! -! = 1 -! -( = % = −4 -! -% = ( = 3 -! -" = -! -( -( -" = −4 · 1 = −4 -! -$ = -! -( -( -$ = −4 · 1 = −4 Applying the Chain Rule to Computational Graphs
  10. 10. From computational graphs to MLP x +! ", $, % = ' ()") + (+"+ + , (0 "0 , x (1 "1 + s /'(") /" = 234 1 + 234 2 = 1 + 234 − 1 1 + 234 1 1 + 234 ' " = 1 1 + 234 2 −1 −3 −2 −3 −2 6 4 1 0.73 1 /'(") /" = (1 − '("))('(")) 0,20,2 0,2 -0,2 0,2 0,2 0,4 -0,4 -0,6 From Stanford Course: Convolutional Neural Networks for Visual Recognition :! :! = 1 :! :(0 :! :(1 :! :"0 :! :"1 :! :, Applying the Chain Rule to Computational Graphs Numerical Examples
  11. 11. Computational graphs Gates. Backward Pass ! " = 1 1 + &'( )!(") )" = (1 − !("))(!(")) - = " + . /- /" = 1 /- /. = 1 0 = -1 /0 /- = 1 /0 /1 = - Sum: Distributes the gradient to both branches Product: Switches gradient weigth values Max: Routes the gradient only to the higher input branche (not sensitive to the lower branche) " . -0,2 0,2 0 max 2 1 2 + In general: Derivative of a function Add branches: Branches that split in the forward pass and merge in the backward pass, add gradients
  12. 12. Computational graphs Numerical Examples From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017 x ! ", $ = & ' ( 2 = * +,- . & ' ( + / = * +,- . 0 + / & ( 0 0,116 1! 12+ = 22+ 1 3&! = 20 ' (4 L2 0.1 0.5 −0.3 0.8 0.2 0.4 0.22 0.26 0.44 0.52 0.088 0.176 0.104 0.208 30! = 20 3(! = 2&? ' 0 −0.112 0.636
  13. 13. Backpropagation applied to an element of the MLP For a single neuron with its linear and non-linear part ℎ" # g(·) ℎ( ) ℎ" #*" +)*" ,)*" = .(/),) +0)) = .(+)*") 1+)*" 1,) = /2 1,)*" 1+)*" = .3(+)*")
  14. 14. …Backpropagation applied to Multilayer Perceptron
  15. 15. h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Forward Pass h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Backward Pass Backpropagation is applied to the Backward Pass
  16. 16. Probability Class given an input (softmax) h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Figure Credit: Kevin McGuiness Forward Pass
  17. 17. Probability Class given an input (softmax) Loss function; e.g., negative log-likelihood (good for classification) h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Regularization term (L2 Norm) aka as weight decay Figure Credit: Kevin McGuiness Forward Pass
  18. 18. Probability Class given an input (softmax) Minimize the loss (plus some regularization term) w.r.t. Parameters over the whole training set. Loss function; e.g., negative log-likelihood (good for classification) h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Regularization term (L2 Norm) aka as weight decay Figure Credit: Kevin McGuiness Forward Pass
  19. 19. 1. Find the error in the top layer: h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass
  20. 20. 1. Find the error in the top layer: 2. Compute weight updates h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass To simplify we don’t consider the biass
  21. 21. 1. Find the error in the top layer: 3. Backpropagate error to layer below2. Compute weight updates h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass To simplify we don’t consider the biass
  22. 22. Another perspective: Modular backprop You could use the chain rule on all the individual neurons to compute the gradients with respect to the parameters and backpropagate the error signal. It is useful to use the layer abstraction Then define the backpropagation algorithm in terms of three operations that layers need to be able to do. This is called modular backpropagation
  23. 23. The layer abstraction * see change of notation *
  24. 24. Linear layer
  25. 25. ReLU layer
  26. 26. Modular backprop Using this idea, it is possible to create many types of layers ● Linear (fully connected layers) ● Activation functions (sigmoid, ReLU) ● Convolutions ● Pooling ● Dropout Once layers support the backward and forward operations, they can be plugged together to create more complex functions Convolution Input Error (L) Gradients ReLU Linear Gradients Output Error (L+1)
  27. 27. Implementation notes Caffe and Torch Libraries like Caffe and Torch implement backpropagation this way. To define a new layer, you need to create an class and define the forward and backward operations. Theano and TensorFlow Libraries like Theano and TensorFlow operate on a computational graph. To define a new layer, you only need to specify the forward operation. Autodiff is used to automatically infer backward. You also don't need to implement backprop manually in Theano or TensorFlow. It uses computational graph optimizations to automatically factor out common computations.
  28. 28. Issues on Backpropagation and Training Gradient Descent: Move the parameter !"in small steps in the direction opposite sign of the derivative of the loss with respect j. !($) = !($'() − * $'( + ,-ℒ /, 1 2 − 3! $'( Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a minibatch of examples. Weight Decay: Regularization term that penalizes large weights, distributes values among all the parameters Momentum: the movement direction of parameters averages the gradient estimation with previous ones. Several strategies have been proposed to update the weights: optimizers
  29. 29. Note on hyperparameters So far we have lots of hyperparameters to choose: 1. Learning rate (a) 2. Weight decay (l) 3. Number of epochs 4. Number of hidden layers 5. Nodes in each hidden layer 6. Weight initialization strategy 7. Loss function 8. Activation functions 9. ... … next class more
  30. 30. Summary • Backpropagation is applied during the Backward pass while training • Computational graphs help to understand the chain rule of differentiation • Parameters in layer k only depend on the error from the above layer and the output from the layer below. This means that the gradients for each layer can be computed iteratively, starting at the last layer and propagating the error back through the network. • Hyperparameters have to be chosen and it’s not obvious • For a “deeper” study: http://www.deeplearningbook.org/

×