Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Backpropagation (D1L5 Deep Learning for Speech and Language)

479 views

Published on

https://telecombcn-dl.github.io/2017-dlsl/

Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.

The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.

Published in: Data & Analytics
  • Be the first to comment

Backpropagation (D1L5 Deep Learning for Speech and Language)

  1. 1. [course site] Day 1 Lecture 5 Backpropagation in Deep Networks Elisa Sayrol
  2. 2. From previous lectures L Hidden Layers Hidden pre-activation (k>0) Hidden activation (k=1,…L) Output activation (k=L+1) Figure Credit: Hugo Laroche NN course
  3. 3. Backpropagation algorithm The output of the Network gives class scores that depens on the input and the parameters • Define a loss function that quantifies our unhappiness with the scores across the training data. • Come up with a way of efficiently finding the parameters that minimize the loss function (optimization)
  4. 4. Probability Class given an input (softmax) h2 h3 a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Figure Credit: Kevin McGuiness Forward Pass
  5. 5. Probability Class given an input (softmax) Loss function; e.g., negative log-likelihood (good for classification) h2 h3 a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Regularization term (L2 Norm) aka as weight decay Figure Credit: Kevin McGuiness Forward Pass
  6. 6. Probability Class given an input (softmax) Minimize the loss (plus some regularization term) w.r.t. Parameters over the whole training set. Loss function; e.g., negative log-likelihood (good for classification) h2 h3 a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Regularization term (L2 Norm) aka as weight decay Figure Credit: Kevin McGuiness Forward Pass
  7. 7. Backpropagation algorithm • We need a way to fit the model to data: find parameters (W(k) , b(k) ) of the network that (locally) minimize the loss function. • We can use stochastic gradient descent. Or better yet, mini-batch stochastic gradient descent. • To do this, we need to find the gradient of the loss function with respect to all the parameters of the model (W(k) , b(k) ) • These can be found using the chain rule of differentiation. • The calculations reveal that the gradient wrt. the parameters in layer k only depends on the error from the above layer and the output from the layer below. • This means that the gradients for each layer can be computed iteratively, starting at the last layer and propagating the error back through the network. This is known as the backpropagation algorithm. Slide Credit: Kevin McGuiness
  8. 8. 1. Find the error in the top layer: h2 h3 a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass
  9. 9. 1. Find the error in the top layer: 2. Compute weight updates h2 h3 a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass To simplify we don’t consider the biass
  10. 10. 1. Find the error in the top layer: 3. Backpropagate error to layer below2. Compute weight updates h2 h3 a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass To simplify we don’t consider the biass
  11. 11. Optimization Stochastic Gradient Descent Stochastic Gradient Descent with momentum Stochastic Gradient Descent with L2 regularization Backpropagation: http://cs231n.github.io/optimization-2/ : learning rate : weight decay Recommended lectures: Optimization: http://sebastianruder.com/optimizing-gradient-descent/ Sebastian Ruder Blog
  12. 12. In the backward pass you might be in the flat part of the sigmoid (or any other activation function like tanh) so derivative tends to zero and your training loss will not go down “Vanishing Gradients”

×