ppt - Deep Learning From Scratch.pdf

www.data4sci.com
@bgoncalves
Optimization Problem
• (Machine) Learning can be thought of as an optimization problem.
• Optimization Problems have 3 distinct pieces:
• The constraints
• The function to optimize
• The optimization algorithm.

www.data4sci.com
@bgoncalves
Optimization Problem
• (Machine) Learning can be thought of as an optimization problem.
• Optimization Problems have 3 distinct pieces:
• The constraints
• The function to optimize
• The optimization algorithm.
Problem Representation
Prediction Error
Gradient Descent

www.data4sci.com
@bgoncalves
Linear Regression
• We are assuming that our functional dependence is of the form:
• In other words, at each step, our hypothesis is:
and it imposes a Constraint on the solutions that can be found.
• We quantify our far our hypothesis is from the correct value using
an Error Function:
or, vectorially:
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample M
Feature
1
Feature
3
Feature
2
…
value
Feature
N
X y
f ( ⃗
x ) = w0 + w1x1 + ⋯ + wnxn ≡ X ⃗
w
Jw (X, ⃗
y ) =
1
2m ∑
i
[hw (x(i)
) − y(i)
]
2
Jw (X, ⃗
y ) =
1
2m
[X ⃗
w − ⃗
y ]
2
hw (X) = X ⃗
w ≡ ̂
y

www.data4sci.com
@bgoncalves
Linear Regression

www.data4sci.com
@bgoncalves
0
3.25
6.5
9.75
13
0 5 10 15 20
Geometric Interpretation
Quadratic error
means that an error
twice as large is
penalized four times
as much.
Jw (X, ⃗
y ) =
1
2m
[X ⃗
w − ⃗
y ]
2

www.data4sci.com
@bgoncalves
Gradient Descent
• Goal: Find the minimum of by varying the components of
• Intuition: Follow the slope of the error function until convergence
• Algorithm:
• Guess (initial values of the parameters)
• Update until “convergence”:
Jw (X, ⃗
y ) ⃗
w
−
δ
δ ⃗
w
Jw (X, ⃗
y )
w
J
w
(
X,
⃗
y
)
⃗
w(0)
wj = wj − α
δ
δwj
Jw (X, ⃗
y )
J
w
(
X
,
⃗
y
)
−
δ
δ ⃗
w
Jw (X, ⃗
y )
step size
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )

www.data4sci.com
@bgoncalves
`
0
3.25
6.5
9.75
13
0 5 10 15 20
2D
y = w0 + w1x1
3D
y = w0 + w1x1 + w2x2
nD
y = X ⃗
w
Add
to account
for intercept
x0 ≡ 1
Finds the hyperplane that
splits the points in two
such that the errors on
each side balance out

Code - Linear Regression
https://github.com/DataForScience/DeepLearning

www.data4sci.com
@bgoncalves
Learning Procedure
input hypothesis
predicted
output
XT hw (X) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint

www.data4sci.com
@bgoncalves
Learning Procedure
input hypothesis
predicted
output
XT hw (X) = ϕ (XT
w) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint
Which we
can redefine

www.data4sci.com
@bgoncalves
Learning Procedure
hypothesis
predicted
output
ϕ (z) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint
input
XT
z = XT
w
And
rewrite

@bgoncalves
Logistic Regression (Classification)
• Not actually regression, but rather Classification
• Predict the probability of instance belonging to the given class:
• Use sigmoid/logistic function to map weighted inputs to
hw (X) ∈ [0,1]
hw (X) = ϕ (X ⃗
w)
[0,1]
z encapsulates all
the parameters and
input values
1 - part of the class
0 - otherwise

www.data4sci.com
@bgoncalves
ϕ (z) ≥
1
2
maximize the value
of z for members of
the class

www.data4sci.com
@bgoncalves
Logistic Regression
• Error Function - Cross Entropy
measures the “distance” between two probability distributions
• Effectively treating the labels as probabilities (an instance with label=1 has Probability 1 of
belonging to the class).
• Gradient - same as Logistic Regression
Jw (X, ⃗
y ) = −
1
m [yT
log (hw (X)) + (1 − y)
T
log (1 − hw (X))]
hw (X) =
1
1 + e−X ⃗
w
wj = wj − α
δ
δwj
Jw (X, ⃗
y )
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )

www.data4sci.com
@bgoncalves
Iris dataset

Code - Logistic Regression

www.data4sci.com
@bgoncalves
Logistic Regression

www.data4sci.com
@bgoncalves
Learning Procedure
hypothesis
predicted
output
ϕ (z) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint
input
XT
z = XT
w

www.data4sci.com
@bgoncalves
Comparison
• Linear Regression • Logistic Regression
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )
Jw (X, ⃗
y ) = −
1
m [yT
log (hw (X)) + (1 − y)
T
log (1 − hw (X))]
Jw (X, ⃗
y ) =
1
2m
[hw (X) − ⃗
y ]
2
hw (X) = ϕ (Z)
hw (X) = ϕ (Z)
z = X ⃗
w z = X ⃗
w
ϕ (Z) =
1
1 + e−Z
ϕ (Z) = Z
Map features to a
continuous variable
Predict based on
continuous variable
Compare
prediction with
reality

www.data4sci.com
@bgoncalves
Learning Procedure
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
zj
wT
x (z)
w
0j
1
Inputs Weights
Activation
function
Bias

www.data4sci.com
@bgoncalves
Generalized Perceptron
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
zj
wT
x (z)
w
0j
1
Inputs Weights
Activation
function
Bias
By changing the
activation function,
we change the
underlying algorithm

www.data4sci.com
@bgoncalves
Activation Function
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract representation of the data

www.data4sci.com
@bgoncalves
Activation Function - Linear
• Differentiable
• non-decreasing
• Each layer builds up a more abstract
representation of the data
• The simplest
ϕ (z) = z

www.data4sci.com
@bgoncalves
Activation Function - Linear
• Differentiable
• non-decreasing
• The simplest
Linear Regression
ϕ (z) = z

www.data4sci.com
@bgoncalves
Activation Function - Sigmoid
• Differentiable
• non-decreasing
• Perhaps the most common
ϕ (z) =
1
1 + e−z

www.data4sci.com
@bgoncalves
Activation Function - Sigmoid
• Differentiable
• non-decreasing
• Perhaps the most common
ϕ (z) =
1
1 + e−z
Logistic Regression

www.data4sci.com
@bgoncalves
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• obtain the inputs
• multiply the inputs by the respective weights
• calculate output using the activation function

Code - Forward Propagation

www.data4sci.com
@bgoncalves
Activation Function - ReLu
• Differentiable
• non-decreasing
• Results in faster learning than with
sigmoid
ϕ (z) = z, z > 0

www.data4sci.com
@bgoncalves
ϕ (z) = z, z > 0
Activation Function - ReLu
• Differentiable
• non-decreasing
• Results in faster learning than with
sigmoid
Stepwise Regression

www.data4sci.com
@bgoncalves
Stepwise Regression
• Multivariate Adaptive Regression Spline (MARS) is the best known example
• Fit curves using a linear combination of:
• The basis functions can be:
• Constant
• “Hinge” functions of the form: and
• Products of hinges
https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
max(0, x − b) max(0, b − x)
̂
f (x) =
∑
i
ciBi (x)
Bi (x)

www.data4sci.com
@bgoncalves
Stepwise Regression
• Multivariate Adaptive Regression Spline (MARS) is the best known example
• Fit curves using a linear combination of:
https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
+ 1.591 max (0, x 0.907)
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
̂
f (x) =
∑
i
ciBi (x)

www.data4sci.com
@bgoncalves
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• obtain the inputs
• multiply the inputs by the respective weights
• calculate output using the activation function
• To create a multi-layer perceptron, you can simply use the output of one layer as the input to
the next one.
• But how can we propagate back the errors and update the weights?
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
aj
w
0j
1
wT
x
1
w
0k
w
1k
w2k
w3k
wNk
ak
wT
a
a1
a2
aN

www.data4sci.com
@bgoncalves
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
̂
f (x) =
∑
i
ciBi (x)
y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
+ 1.591 max (0, x 0.907)
ReLu
1
x
ReLu
1
x
ReLu
1
x
Linear
1

www.data4sci.com
@bgoncalves
The Cross Entropy is complementary to sigmoid
activation in the output layer and improves its stability.
Loss Functions
• For learning to occur, we must quantify how far off we are from the desired output. There are
two common ways of doing this:
• Quadratic error function:
• Cross Entropy
E =
1
N
X
n
|yn an|
2
J =
1
N
X
n
h
yT
n log an + (1 yn)
T
log (1 an)
i

www.data4sci.com
@bgoncalves
Regularization
• Helps keep weights relatively small by adding a penalization to the cost function.
• Two common choices:
• Lasso helps with feature selection by driving less important weights to zero
̂
Jw (X) = Jw (X) + λ
∑
ij
w2
ij
̂
Jw (X) = Jw (X) + λ
∑
ij
wij “Lasso”
L2

www.data4sci.com
@bgoncalves
Backward Propagation of Errors (BackProp)
• BackProp operates in two phases:
• Forward propagate the inputs and calculate the deltas
• Update the weights
• The error at the output is a weighted average difference between predicted output and the
observed one.
• For inner layers there is no “real output”!

www.data4sci.com
@bgoncalves
• Let be the error at each of the total layers:
• Then:
• And for every other layer, in reverse order:
• Until:
as there’s no error on the input layer.
• And finally:
BackProp
δ(l)
δ(L)
= hw (X) − y
δ(l)
= W(l)T
δ(l+1)
. * ϕ†
(z(l)
)
L
δ(1)
≡ 0
Δ(l)
ij
= Δ(l)
ij
+ a(l)
j
δ(l+1)
i
∂
∂w(l)
ij
Jw (X, ⃗
y ) =
1
m
Δ(l)
ij
+ λw(l)
ij

@bgoncalves
A practical example - MNIST
8
2
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample N
Feature
1
Feature
3
Feature
2
…
Label
Feature
M
yann.lecun.com/exdb/mnist/

@bgoncalves
yann.lecun.com/exdb/mnist/
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample M
Feature
1
Feature
3
Feature
2
…
Label
Feature
N
X y
3 layers:
1 input layer
1 hidden layer
1 output layer
X ⇥1 ⇥2
arg
max

www.data4sci.com
@bgoncalves
784 50 10 1
5000
examples
10 ⇥ 51
50 ⇥ 785
def predict(Theta1, Theta2, X):
h1 = forward(Theta1, X, sigmoid)
h2 = forward(Theta2, h1, sigmoid)
return np.argmax(h2, 1)
50 ⇥ 784 10 ⇥ 50
Forward Propagation
Backward Propagation
Vectors
Matrices
Matrices
X ⇥1 ⇥2
arg
max
def forward(Theta, X, active):
N = X.shape[0]
# Add the bias column
X_ = np.concatenate((np.ones((N, 1)), X), 1)
# Multiply by the weights
z = np.dot(X_, Theta.T)
# Apply the activation function
a = active(z)
return a

Code - Simple Network

ppt - Deep Learning From Scratch.pdf

Recommended

Recommended

More Related Content

Similar to ppt - Deep Learning From Scratch.pdf

Similar to ppt - Deep Learning From Scratch.pdf (20)

Recently uploaded

Recently uploaded (20)

ppt - Deep Learning From Scratch.pdf