Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup - 08-04-2016

2,015 views

Published on

Advanced Spark and TensorFlow Meetup 08-04-2016

Fundamental Algorithms of Neural Networks including Gradient Descent, Back Propagation, Auto Differentiation, Partial Derivatives, Chain Rule

Published in: Software

Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup - 08-04-2016

  1. 1. Backprop, Gradient Descent, And Auto Differentiation Sam Abrahams, Memdump LLC
  2. 2. https://goo.gl/tKOvr 7 Link to these slides:
  3. 3. YO! I am Sam Abrahams I am a data scientist and engineer. You can find me on GitHub @samjabrahams Buy my book: TensorFlow for Machine Intelligence
  4. 4. 1. Gradient Descent Guess and Check for Adults
  5. 5. Gradient Descent Outline ▣ Problem: fit data ▣ Basic OLS linear regression ▣ Visualize error curve and regression line ▣ Step by step through changes
  6. 6. Scatter Plot Simple Start: Linear Regression
  7. 7. Ordinary Least Squares Linear Regression Simple Start: Linear Regression
  8. 8. Simple Start: Linear Regression ▣ Want to find a model that can fit our data ▣ Could do it algebraically… ▣ BUT that doesn’t generalize well
  9. 9. Simple Start: Linear Regression ▣ Step back: what does ordinary linear regression try to do? ▣ Minimize the sum of (or average) squared error ▣ How else could we minimize?
  10. 10. Gradient Descent ▣ Start with a random guess ▣ Use the derivative (gradient when dealing with multiple variables) to get the slope of the error curve ▣ Move our parameters to move down the error curve
  11. 11. Single Variable Cost Curve J (cost) W
  12. 12. Single Variable Cost Curve J (cost) Random guess put us here W
  13. 13. ∂ W J (cost) ∂J ∂W
  14. 14. ∂ W J (cost) ∂J ∂W < 0
  15. 15. ∂ W J (cost) ∂J ∂W < 0; move to the right!
  16. 16. Single Variable Cost Curve J (cost) W
  17. 17. Single Variable Cost Curve J (cost) W ∂J ∂W
  18. 18. Single Variable Cost Curve J (cost) W ∂J ∂W < 0
  19. 19. Single Variable Cost Curve J (cost) W ∂J ∂W < 0; move to the right!
  20. 20. Single Variable Cost Curve J (cost) W
  21. 21. Single Variable Cost Curve J (cost) W ∂J ∂W
  22. 22. Single Variable Cost Curve J (cost) W ∂J ∂W
  23. 23. Single Variable Cost Curve J (cost) ∂J ∂W W
  24. 24. 1.5 Gradient Descent Variants Intelligent descent into madness
  25. 25. Gradient Descent Variants ▣ There are additional techniques that can help speed up (or otherwise improve) gradient descent ▣ The next slides describe some of these! ▣ More details (and some awesome visuals) here: article by Sebastian Ruder
  26. 26. Gradient Descent ▣Get true gradient with respect to all examples ▣One step = one epoch ▣Slow and generally unfeasible for large training sets
  27. 27. Gradient Descent
  28. 28. Stochastic Gradient Descent ▣Basic idea: approximate derivative by only using one example ▣“Online learning” ▣Update weights after each example
  29. 29. Stochastic Gradient Descent
  30. 30. Mini-Batch Gradient Descent ▣Similar idea to stochastic gradient descent ▣Approximate derivative with a sample batch of examples ▣Middle ground between “true” stochastic gradient and full gradient descent
  31. 31. Mini-Batch Gradient Descent
  32. 32. Momentum ▣Idea: if we see multiple gradients in a row with same direction, we should increase our learning rate ▣Accumulate a “momentum” vector to speed up descent
  33. 33. Without Momentum
  34. 34. Momentum
  35. 35. Nesterov Momentum ▣ Idea: before updating our weights, look ahead to where we have accumulated momentum ▣ Adjust our update based on “future”
  36. 36. Nesterov Momentum Source: Lecture by Geoffrey Hinton Momentum Vector Gradient/correction Nesterov steps Standard momentum steps
  37. 37. AdaGrad ▣ Idea: update individual weights differently depending on how frequently they change ▣ Keeps a running tally of previous updates for each weight, and divides new updates by a factor of the previous updates ▣ Downside: for long running training, eventually all gradients diminish ▣ Paper on jmlr.org
  38. 38. AdaDelta / RMSProp ▣ Two slightly different algorithms with same concept: only keep a window of the previous n gradients when scaling updates ▣ Seeks to reduce diminishing gradient problem with AdaGrad ▣ AdaDelta Paper on arxiv.org
  39. 39. Adam ▣ Adam expands on the concepts introduced with AdaDelta and RMSProp ▣ Uses both first order and second order moments, decayed over time ▣ Paper on arxiv.org
  40. 40. 2. Forward & Back Propagation The Chain Rule got the last laugh, high-school-you
  41. 41. Beyond OLS Regression ▣ Can’t do everything with linear regression! ▣ Nor polynomial… ▣ Why can’t we let the computer figure out how to model?
  42. 42. Neural Networks: Idea ▣ Chain together non-linear functions ▣ Have lots of parameters that can be adjusted ▣ These “weights” determine the model function
  43. 43. Feed forward neural network +1 +1 x1 x2 +1 l (2) l (3) l (4) l (1) input hidden 1 hidden 2 output W(1) W(2) W(3) a(2) a(3) a(4) ŷ
  44. 44. σ σ σ +1 σ σ σ +1 SM SM SM x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l σ: sigmoid (logistic) function SM: Softmax function
  45. 45. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Layer 1 W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
  46. 46. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Layer 2 W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
  47. 47. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Layer 3 W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
  48. 48. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Layer 4 W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
  49. 49. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Biases (constant units) W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
  50. 50. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Input W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
  51. 51. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Weight matrices W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
  52. 52. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Layer inputs, z(l) W(l) : weight matrix for layer l z(l) : input into layer l z(l) = W(l-1) a(l-1) + b(l-1) SM SM SM σ: sigmoid (logistic) function SM: Softmax function
  53. 53. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Activation vectors W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
  54. 54. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Sigmoid activation function W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
  55. 55. SM SM SM σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Softmax activation function W(l) : weight matrix for layer l z(l) : input into layer l σ: sigmoid (logistic) function SM: Softmax function
  56. 56. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l Output SM SM SM
  57. 57. x1 x2 W(1) W(2) W(3) a(2) a(3) a(4) Forward Propagation Input vector is passed into the network
  58. 58. x1 x2 +1 W(1) a(2) W(2) W(3) a(3) a(4) Forward Propagation Input is multiplied with W(1) weight matrix and added with layer 1 biases to calculate z(2) z(2) = W(1) x + b(1)
  59. 59. σ σ σ x1 x2 +1 W(1) a(2) Forward Propagation W(2) W(3) a(3) a(4) Activation value for the second layer is calculated by passing z(2) into some function. In this case, the sigmoid function. a(2) = σ(z(2) )
  60. 60. σ σ σ +1 x1 x2 +1 W(1) W(2) a(2) Forward Propagation W(3) a(3) a(4) z(3) is calculated by multiplying a(2) vector with W(2) weight matrix and adding layer 2 biases z(3) = W(2) a(2) + b(2)
  61. 61. σ σ σ +1 σ σ σ x1 x2 +1 W(1) W(2) a(2) a(3) Forward Propagation Similar to previous layer, a(3) is calculated by passing z(3) into the sigmoid function a(3) = σ(z(3) ) W(3)
  62. 62. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) Forward Propagation z(4) is calculated by multiplying a(3) vector with W(3) weight matrix and adding layer 3 biases z(4) = W(3) a(3) + b(3) a(4)
  63. 63. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) SM SM SM Forward Propagation For the final layer, we calculate a(4) by passing z(4) into the Softmax function a(4) = SM(z(4) )
  64. 64. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ SM SM SM Forward Propagation We then make our prediction based on the final layer’s output
  65. 65. Page of Math z(2) = W(1) x + b(1) z(3) = W(2) a(2) + b(2) z(4) = W(3) a(3) + b(3) a(2) = σ(z(2) ) a(3) = σ(z(3) ) a(4) = ŷ = SM(z(4) )
  66. 66. Goal: Find which direction to shift weights How: Find partial derivatives of the cost with respect to weight matrices How (again): Chain rule the sh*t out of this mofo
  67. 67. DANGER: MATH
  68. 68. Chain Rule Reminder
  69. 69. Chain Rule Reminder
  70. 70. Chain rule example Find derivative with respect to x:
  71. 71. Chain rule example First split into two functions:
  72. 72. Chain rule example Then get derivative of components:
  73. 73. Chain rule example
  74. 74. Chain rule example
  75. 75. Chain rule example
  76. 76. Chain rule example
  77. 77. Chain rule example
  78. 78. Chain rule example
  79. 79. Chain rule example
  80. 80. DEEPER
  81. 81. DEEPER Want:
  82. 82. DEEPER
  83. 83. DEEPER NOTE: “Cancelling out” isn’t how the math actually works. But it’s a handy way to think about it.
  84. 84. DEEPER NOTE: “Cancelling out” isn’t how the math actually works. But it’s a handy way to think about it.
  85. 85. DEEPER NOTE: “Cancelling out” isn’t how the math actually works. But it’s a handy way to think about it.
  86. 86. Back Prop Back to backpropagation: Want:
  87. 87. Return of Page of Math z(2) = W(1) x + b(1) z(3) = W(2) a(2) + b(2) z(4) = W(3) a(3) + b(3) a(2) = σ(z(2) ) a(3) = σ(z(3) ) a(4) = ŷ = SM(z(4) )
  88. 88. Partials, step by step a(4) = ŷ = SM(z(4) ) With cross entropy loss:
  89. 89. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
  90. 90. Partials, step by step
  91. 91. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
  92. 92. Partials, step by step
  93. 93. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
  94. 94. Partials, step by step
  95. 95. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
  96. 96. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
  97. 97. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
  98. 98. Partials, step by step
  99. 99. As programmers... How do we NOT do this ourselves? We’re lazy by trade.
  100. 100. 3. Automatic Differentiation Bringing sexy lazy back
  101. 101. Why not hard code? ▣ Want to iterate fast! ▣ Want flexibility ▣ Want to reuse our code!
  102. 102. Auto-Differentiation: Idea ▣ Use functions that have easy-to-compute derivatives ▣ Compose these functions to create more complex super-model ▣ Use the chain rule to get partial derivatives of the model
  103. 103. What makes a “good” function? ▣ Obvious stuff: differentiable (continuously and smoothly!) ▣ Simple operations: add, subtract, multiply ▣ Reuse previous computation
  104. 104. Nice functions: sigmoid
  105. 105. Nice functions: sigmoid
  106. 106. Nice functions: hyperbolic tangent
  107. 107. Nice functions: hyperbolic tangent
  108. 108. Nice functions: Rectified linear unit
  109. 109. Nice functions: Rectified linear unit
  110. 110. Nice functions: Addition
  111. 111. Nice functions: Addition
  112. 112. Nice functions: Multiplication
  113. 113. Good news: Most of these use activation values! Can store in cache!
  114. 114. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ SM SM SM Store activation values for backprop
  115. 115. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ SM SM SM Chain rule takes care of the rest
  116. 116. It’s Over! Any questions? Email: sam@memdump.io GitHub: samjabrahams Twitter: @sabraha Presentation template by SlidesCarnival
  117. 117. Neural Network terms ▣ Neuron: a unit that transforms input via an activation function and outputs the result to other neurons and/or the final result ▣ Activation function: a(l) , a transformation function, typically non-linear. Sigmoid, ReLU ▣ Bias unit: a trainable scalar shift, typically applied to each non-output layer (think y-intercept term in the linear function) ▣ Layer: a grouping of “neurons” and biases that (in general) take in values from the same previous neurons and pass values forwards to the same targets ▣ Hidden layer: A layer that is neither the input layer nor the output layer ▣ Input layer: ▣ Output layer
  118. 118. Terminology used ▣ Learning rate ▣ Parameters ▣ Training step ▣ Training example ▣ Epoch vs training time ▣

×