Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup - 08-04-2016
The document discusses gradient descent and its variants, focusing on how to optimize models through iterative adjustments of parameters using derivatives. It explains the concepts of forward and backpropagation in neural networks, highlighting the importance of applying the chain rule for calculating gradients. Additionally, it provides an overview of various optimization algorithms like stochastic gradient descent, momentum methods, and adaptive learning rates used in training neural networks.
Overview of backpropagation, gradient descent, and auto-differentiation by Sam Abrahams.
Introduction to Gradient Descent: fitting data using OLS, visualizing error curves, and an iterative approach to minimize error.
Single Variable Cost Curve: understanding the relationship between parameters, cost functions, and optimization through movement along the error curve.
Exploration of variants of Gradient Descent including Stochastic, Mini-Batch, Momentum, and advanced methods like AdaGrad, RMSProp, and Adam.
Concept of Neural Networks: chaining non-linear functions, adjustable parameters (weights), and architecture representation.
Details on various neural network layers including activation functions such as sigmoid and softmax, and their roles in computation.
The step-by-step process of forward propagation in a neural network from input to output layer calculations.Detailed explanation of backpropagation, focusing on chain rule and the computation of derivatives in neural networks for training.Introduction to automatic differentiation, its benefits for coding efficiency, and the use of functions that are easy to differentiate.
Final slide summarizing the presentation and providing contact information with a glossary of neural network terminology.
Simple Start: LinearRegression
▣ Want to find a model that can fit our data
▣ Could do it algebraically…
▣ BUT that doesn’t generalize well
9.
Simple Start: LinearRegression
▣ Step back: what does ordinary linear regression
try to do?
▣ Minimize the sum of (or average) squared error
▣ How else could we minimize?
10.
Gradient Descent
▣ Startwith a random guess
▣ Use the derivative (gradient when dealing with
multiple variables) to get the slope of the error
curve
▣ Move our parameters to move down the error
curve
Gradient Descent Variants
▣There are additional techniques that can help
speed up (or otherwise improve) gradient
descent
▣ The next slides describe some of these!
▣ More details (and some awesome visuals)
here: article by Sebastian Ruder
26.
Gradient Descent
▣Get truegradient with respect to all examples
▣One step = one epoch
▣Slow and generally unfeasible for large training
sets
Mini-Batch Gradient Descent
▣Similaridea to stochastic gradient descent
▣Approximate derivative with a sample batch of
examples
▣Middle ground between “true” stochastic
gradient and full gradient descent
Momentum
▣Idea: if wesee multiple gradients in a row with
same direction, we should increase our learning
rate
▣Accumulate a “momentum” vector to speed up
descent
AdaGrad
▣ Idea: updateindividual weights differently
depending on how frequently they change
▣ Keeps a running tally of previous updates for
each weight, and divides new updates by a
factor of the previous updates
▣ Downside: for long running training,
eventually all gradients diminish
▣ Paper on jmlr.org
38.
AdaDelta / RMSProp
▣Two slightly different algorithms with same
concept: only keep a window of the previous
n gradients when scaling updates
▣ Seeks to reduce diminishing gradient problem
with AdaGrad
▣ AdaDelta Paper on arxiv.org
39.
Adam
▣ Adam expandson the concepts introduced
with AdaDelta and RMSProp
▣ Uses both first order and second order
moments, decayed over time
▣ Paper on arxiv.org
Beyond OLS Regression
▣Can’t do everything with linear regression!
▣ Nor polynomial…
▣ Why can’t we let the computer figure out how
to model?
42.
Neural Networks: Idea
▣Chain together non-linear functions
▣ Have lots of parameters that can be adjusted
▣ These “weights” determine the model function
43.
Feed forward neuralnetwork
+1 +1
x1
x2
+1
l (2)
l (3)
l (4)
l (1)
input hidden 1 hidden 2 output
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ:output vector
+1: bias (constant) unit
a(l)
: activation vector for layer l
Layer inputs, z(l)
W(l)
: weight matrix for layer l
z(l)
: input into layer l
z(l)
= W(l-1)
a(l-1)
+ b(l-1)
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
Goal:
Find which directionto shift weights
How:
Find partial derivatives of the cost with
respect to weight matrices
How (again):
Chain rule the sh*t out of this mofo
Why not hardcode?
▣ Want to iterate fast!
▣ Want flexibility
▣ Want to reuse our code!
102.
Auto-Differentiation: Idea
▣ Usefunctions that have easy-to-compute
derivatives
▣ Compose these functions to create more
complex super-model
▣ Use the chain rule to get partial derivatives of
the model
103.
What makes a“good” function?
▣ Obvious stuff: differentiable (continuously
and smoothly!)
▣ Simple operations: add, subtract, multiply
▣ Reuse previous computation
Neural Network terms
▣Neuron: a unit that transforms input via an activation function and outputs the result to
other neurons and/or the final result
▣ Activation function: a(l)
, a transformation function, typically non-linear. Sigmoid, ReLU
▣ Bias unit: a trainable scalar shift, typically applied to each non-output layer (think
y-intercept term in the linear function)
▣ Layer: a grouping of “neurons” and biases that (in general) take in values from the same
previous neurons and pass values forwards to the same targets
▣ Hidden layer: A layer that is neither the input layer nor the output layer
▣ Input layer:
▣ Output layer