Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup - 08-04-2016

Backprop,
Gradient Descent,
And Auto Differentiation
Sam Abrahams, Memdump LLC

https://goo.gl/tKOvr
7
Link to these slides:

YO!
I am Sam Abrahams
I am a data scientist and engineer.
You can find me on GitHub @samjabrahams
Buy my book:
TensorFlow for Machine Intelligence

1.
Gradient
Descent
Guess and Check
for Adults

Gradient Descent Outline
▣ Problem: fit data
▣ Basic OLS linear regression
▣ Visualize error curve and regression line
▣ Step by step through changes

Scatter Plot
Simple Start: Linear Regression

Ordinary Least Squares Linear Regression

▣ Want to find a model that can fit our data
▣ Could do it algebraically…
▣ BUT that doesn’t generalize well

▣ Step back: what does ordinary linear regression
try to do?
▣ Minimize the sum of (or average) squared error
▣ How else could we minimize?

Gradient Descent
▣ Start with a random guess
▣ Use the derivative (gradient when dealing with
multiple variables) to get the slope of the error
curve
▣ Move our parameters to move down the error
curve

Single Variable Cost Curve
J (cost)
W

J (cost)
Random guess put us here
W

∂
W
J (cost)
∂J
∂W < 0; move to the right!

J (cost)
W
∂J
∂W

J (cost)
W
∂J
∂W < 0

J (cost)
W
∂J
∂W < 0; move to the right!

J (cost)
∂J
∂W
W

1.5
Gradient
Descent Variants
Intelligent descent
into madness

Gradient Descent Variants
▣ There are additional techniques that can help
speed up (or otherwise improve) gradient
descent
▣ The next slides describe some of these!
▣ More details (and some awesome visuals)
here: article by Sebastian Ruder

Gradient Descent
▣Get true gradient with respect to all examples
▣One step = one epoch
▣Slow and generally unfeasible for large training
sets

Stochastic Gradient Descent
▣Basic idea: approximate derivative by only using
one example
▣“Online learning”
▣Update weights after each example

Mini-Batch Gradient Descent
▣Similar idea to stochastic gradient descent
▣Approximate derivative with a sample batch of
examples
▣Middle ground between “true” stochastic
gradient and full gradient descent

Momentum
▣Idea: if we see multiple gradients in a row with
same direction, we should increase our learning
rate
▣Accumulate a “momentum” vector to speed up
descent

Nesterov Momentum
▣ Idea: before updating our weights, look ahead
to where we have accumulated momentum
▣ Adjust our update based on “future”

Nesterov Momentum
Source: Lecture by Geoffrey Hinton
Momentum Vector
Gradient/correction
Nesterov steps
Standard momentum steps

AdaGrad
▣ Idea: update individual weights differently
depending on how frequently they change
▣ Keeps a running tally of previous updates for
each weight, and divides new updates by a
factor of the previous updates
▣ Downside: for long running training,
eventually all gradients diminish
▣ Paper on jmlr.org

AdaDelta / RMSProp
▣ Two slightly different algorithms with same
concept: only keep a window of the previous
n gradients when scaling updates
▣ Seeks to reduce diminishing gradient problem
with AdaGrad
▣ AdaDelta Paper on arxiv.org

Adam
▣ Adam expands on the concepts introduced
with AdaDelta and RMSProp
▣ Uses both first order and second order
moments, decayed over time
▣ Paper on arxiv.org

2.
Forward & Back
Propagation
The Chain Rule
got the last laugh,
high-school-you

Beyond OLS Regression
▣ Can’t do everything with linear regression!
▣ Nor polynomial…
▣ Why can’t we let the computer figure out how
to model?

Neural Networks: Idea
▣ Chain together non-linear functions
▣ Have lots of parameters that can be adjusted
▣ These “weights” determine the model function

Feed forward neural network
+1 +1
x1
x2
+1
l (2)
l (3)
l (4)
l (1)
input hidden 1 hidden 2 output
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ

σ
σ
σ
+1
σ
σ
σ
+1
SM
SM
SM
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
+1: bias (constant) unit
a(l)
: activation vector for layer l
W(l)
: weight matrix for layer l
z(l)
: input into layer l
σ: sigmoid (logistic) function
SM: Softmax function

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
Layer 1
W(l)
z(l)
SM
SM
SM

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
Layer 2
W(l)
z(l)
SM
SM
SM

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
Layer 3
W(l)
z(l)
SM
SM
SM

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
Layer 4
W(l)
z(l)
SM
SM
SM

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
Biases (constant units)
W(l)
z(l)
SM
SM
SM

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
Input
W(l)
z(l)
SM
SM
SM

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
Weight matrices
W(l)
z(l)
SM
SM
SM

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
Layer inputs, z(l)
W(l)
z(l)
z(l)
= W(l-1)
a(l-1)
+ b(l-1)
SM
SM
SM

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
Activation vectors
W(l)
z(l)
SM
SM
SM

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
Sigmoid activation function
W(l)
z(l)
SM
SM
SM

SM
SM
SM
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
Softmax activation function
W(l)
z(l)

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
a(l)
W(l)
z(l)
Output
SM
SM
SM

x1
x2
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
Forward Propagation
Input vector is passed into the network

x1
x2
+1
W(1)
a(2)
W(2)
W(3)
a(3)
a(4)
Forward Propagation
Input is multiplied with W(1)
weight matrix and added with
layer 1 biases to calculate z(2)
z(2)
= W(1)
x + b(1)

σ
σ
σ
x1
x2
+1
W(1)
a(2)
Forward Propagation
W(2)
W(3)
a(3)
a(4)
Activation value for the second layer is calculated by passing
z(2)
into some function. In this case, the sigmoid function.
a(2)
= σ(z(2)
)

σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
a(2)
Forward Propagation
W(3)
a(3)
a(4)
z(3)
is calculated by multiplying a(2)
vector with W(2)
weight
matrix and adding layer 2 biases
z(3)
= W(2)
a(2)
+ b(2)

σ
σ
σ
+1
σ
σ
σ
x1
x2
+1
W(1)
W(2)
a(2)
a(3)
Forward Propagation
Similar to previous layer, a(3)
is calculated by passing z(3)
into
the sigmoid function
a(3)
= σ(z(3)
)
W(3)

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
Forward Propagation
z(4)
is calculated by multiplying a(3)
vector with W(3)
weight
matrix and adding layer 3 biases
z(4)
= W(3)
a(3)
+ b(3)
a(4)

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
SM
SM
SM
Forward Propagation
For the final layer, we calculate a(4)
by passing z(4)
into the
Softmax function
a(4)
= SM(z(4)
)

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
SM
SM
SM
Forward Propagation
We then make our prediction based on the final layer’s output

Page of Math
z(2)
= W(1)
x + b(1)
z(3)
= W(2)
a(2)
+ b(2)
z(4)
= W(3)
a(3)
+ b(3)
a(2)
= σ(z(2)
)
a(3)
= σ(z(3)
)
a(4)
= ŷ = SM(z(4)
)

Goal:
Find which direction to shift weights
How:
Find partial derivatives of the cost with
respect to weight matrices
How (again):
Chain rule the sh*t out of this mofo

Chain rule example
Find derivative with respect to x:

Chain rule example
First split into two functions:

Chain rule example
Then get derivative of components:

DEEPER
NOTE: “Cancelling out” isn’t how the math actually works. But it’s a handy way to think about it.

Back Prop
Back to backpropagation:
Want:

Return of Page of Math
z(2)
= W(1)
x + b(1)
z(3)
= W(2)
a(2)
+ b(2)
z(4)
= W(3)
a(3)
+ b(3)
a(2)
= σ(z(2)
)
a(3)
= σ(z(3)
)
a(4)
= ŷ = SM(z(4)
)

Partials, step by step
a(4)
= ŷ = SM(z(4)
)
With cross entropy loss:

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
c
o
s
t
xi
: input value
ŷ: output vector
a(l)
W(l)
z(l)
SM
SM
SM
Back Propagation
Want:

As programmers...
How do we NOT
do this ourselves?
We’re lazy by trade.

3.
Automatic
Differentiation
Bringing sexy lazy
back

Why not hard code?
▣ Want to iterate fast!
▣ Want flexibility
▣ Want to reuse our code!

Auto-Differentiation: Idea
▣ Use functions that have easy-to-compute
derivatives
▣ Compose these functions to create more
complex super-model
▣ Use the chain rule to get partial derivatives of
the model

What makes a “good” function?
▣ Obvious stuff: differentiable (continuously
and smoothly!)
▣ Simple operations: add, subtract, multiply
▣ Reuse previous computation

Nice functions: hyperbolic tangent

Nice functions: Rectified linear unit

Nice functions: Multiplication

Good news:
Most of these use activation
values! Can store in cache!

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
SM
SM
SM
Store activation values for backprop

σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
SM
SM
SM
Chain rule takes care of the rest

It’s Over!
Any questions?
Email: sam@memdump.io
GitHub: samjabrahams
Twitter: @sabraha
Presentation template by SlidesCarnival

Neural Network terms
▣ Neuron: a unit that transforms input via an activation function and outputs the result to
other neurons and/or the final result
▣ Activation function: a(l)
, a transformation function, typically non-linear. Sigmoid, ReLU
▣ Bias unit: a trainable scalar shift, typically applied to each non-output layer (think
y-intercept term in the linear function)
▣ Layer: a grouping of “neurons” and biases that (in general) take in values from the same
previous neurons and pass values forwards to the same targets
▣ Hidden layer: A layer that is neither the input layer nor the output layer
▣ Input layer:
▣ Output layer

Terminology used
▣ Learning rate
▣ Parameters
▣ Training step
▣ Training example
▣ Epoch vs training time
▣

Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup - 08-04-2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup - 08-04-2016

Similar to Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup - 08-04-2016 (20)

More from Chris Fregly

More from Chris Fregly (20)

Recently uploaded

Recently uploaded (20)

Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup - 08-04-2016