SlideShare a Scribd company logo
www.data4sci.com
@bgoncalves
Optimization Problem
• (Machine) Learning can be thought of as an optimization problem.
• Optimization Problems have 3 distinct pieces:
• The constraints
• The function to optimize
• The optimization algorithm.
www.data4sci.com
@bgoncalves
Optimization Problem
• (Machine) Learning can be thought of as an optimization problem.
• Optimization Problems have 3 distinct pieces:
• The constraints
• The function to optimize
• The optimization algorithm.
Problem Representation
Prediction Error
Gradient Descent
www.data4sci.com
@bgoncalves
Linear Regression
• We are assuming that our functional dependence is of the form:
• In other words, at each step, our hypothesis is:
and it imposes a Constraint on the solutions that can be found.
• We quantify our far our hypothesis is from the correct value using
an Error Function:
or, vectorially:
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample M
Feature
1
Feature
3
Feature
2
…
value
Feature
N
X y
f ( ⃗
x ) = w0 + w1x1 + ⋯ + wnxn ≡ X ⃗
w
Jw (X, ⃗
y ) =
1
2m ∑
i
[hw (x(i)
) − y(i)
]
2
Jw (X, ⃗
y ) =
1
2m
[X ⃗
w − ⃗
y ]
2
hw (X) = X ⃗
w ≡ ̂
y
www.data4sci.com
@bgoncalves
Linear Regression
www.data4sci.com
@bgoncalves
0
3.25
6.5
9.75
13
0 5 10 15 20
Geometric Interpretation
Quadratic error
means that an error
twice as large is
penalized four times
as much.
Jw (X, ⃗
y ) =
1
2m
[X ⃗
w − ⃗
y ]
2
www.data4sci.com
@bgoncalves
Gradient Descent
• Goal: Find the minimum of by varying the components of
• Intuition: Follow the slope of the error function until convergence
• Algorithm:
• Guess (initial values of the parameters)
• Update until “convergence”:
Jw (X, ⃗
y ) ⃗
w
−
δ
δ ⃗
w
Jw (X, ⃗
y )
w
J
w
(
X,
⃗
y
)
⃗
w(0)
wj = wj − α
δ
δwj
Jw (X, ⃗
y )
J
w
(
X
,
⃗
y
)
−
δ
δ ⃗
w
Jw (X, ⃗
y )
step size
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )
www.data4sci.com
@bgoncalves
`
0
3.25
6.5
9.75
13
0 5 10 15 20
2D
Geometric Interpretation
y = w0 + w1x1
3D
y = w0 + w1x1 + w2x2
nD
y = X ⃗
w
Add
to account
for intercept
x0 ≡ 1
Finds the hyperplane that
splits the points in two
such that the errors on
each side balance out
Code - Linear Regression
https://github.com/DataForScience/DeepLearning
www.data4sci.com
@bgoncalves
Learning Procedure
input hypothesis
predicted
output
XT hw (X) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint
www.data4sci.com
@bgoncalves
Learning Procedure
input hypothesis
predicted
output
XT hw (X) = ϕ (XT
w) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint
Which we
can redefine
www.data4sci.com
@bgoncalves
Learning Procedure
hypothesis
predicted
output
ϕ (z) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint
input
XT
z = XT
w
And
rewrite
@bgoncalves
Logistic Regression (Classification)
• Not actually regression, but rather Classification
• Predict the probability of instance belonging to the given class:
• Use sigmoid/logistic function to map weighted inputs to
hw (X) ∈ [0,1]
hw (X) = ϕ (X ⃗
w)
[0,1]
z encapsulates all
the parameters and
input values
1 - part of the class
0 - otherwise
www.data4sci.com
@bgoncalves
Geometric Interpretation
ϕ (z) ≥
1
2
maximize the value
of z for members of
the class
www.data4sci.com
@bgoncalves
Logistic Regression
• Error Function - Cross Entropy
measures the “distance” between two probability distributions
• Effectively treating the labels as probabilities (an instance with label=1 has Probability 1 of
belonging to the class).
• Gradient - same as Logistic Regression
Jw (X, ⃗
y ) = −
1
m [yT
log (hw (X)) + (1 − y)
T
log (1 − hw (X))]
hw (X) =
1
1 + e−X ⃗
w
wj = wj − α
δ
δwj
Jw (X, ⃗
y )
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )
www.data4sci.com
@bgoncalves
Iris dataset
www.data4sci.com
@bgoncalves
Iris dataset
Code - Logistic Regression
https://github.com/DataForScience/DeepLearning
www.data4sci.com
@bgoncalves
Logistic Regression
www.data4sci.com
@bgoncalves
Logistic Regression
www.data4sci.com
@bgoncalves
Learning Procedure
hypothesis
predicted
output
ϕ (z) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint
input
XT
z = XT
w
www.data4sci.com
@bgoncalves
Comparison
• Linear Regression • Logistic Regression
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )
Jw (X, ⃗
y ) = −
1
m [yT
log (hw (X)) + (1 − y)
T
log (1 − hw (X))]
Jw (X, ⃗
y ) =
1
2m
[hw (X) − ⃗
y ]
2
hw (X) = ϕ (Z)
hw (X) = ϕ (Z)
z = X ⃗
w z = X ⃗
w
ϕ (Z) =
1
1 + e−Z
ϕ (Z) = Z
Map features to a
continuous variable
Predict based on
continuous variable
Compare
prediction with
reality
www.data4sci.com
@bgoncalves
Learning Procedure
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
zj
wT
x (z)
w
0j
1
Inputs Weights
Activation
function
Bias
www.data4sci.com
@bgoncalves
Generalized Perceptron
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
zj
wT
x (z)
w
0j
1
Inputs Weights
Activation
function
Bias
By changing the
activation function,
we change the
underlying algorithm
www.data4sci.com
@bgoncalves
Activation Function
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract representation of the data
www.data4sci.com
@bgoncalves
Activation Function - Linear
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• The simplest
ϕ (z) = z
www.data4sci.com
@bgoncalves
Activation Function - Linear
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• The simplest
Linear Regression
ϕ (z) = z
www.data4sci.com
@bgoncalves
Activation Function - Sigmoid
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• Perhaps the most common
ϕ (z) =
1
1 + e−z
www.data4sci.com
@bgoncalves
Activation Function - Sigmoid
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• Perhaps the most common
ϕ (z) =
1
1 + e−z
Logistic Regression
www.data4sci.com
@bgoncalves
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• obtain the inputs
• multiply the inputs by the respective weights
• calculate output using the activation function
Code - Forward Propagation
https://github.com/DataForScience/DeepLearning
www.data4sci.com
@bgoncalves
Activation Function - ReLu
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• Results in faster learning than with
sigmoid
ϕ (z) = z, z > 0
www.data4sci.com
@bgoncalves
ϕ (z) = z, z > 0
Activation Function - ReLu
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• Results in faster learning than with
sigmoid
Stepwise Regression
www.data4sci.com
@bgoncalves
Stepwise Regression
• Multivariate Adaptive Regression Spline (MARS) is the best known example
• Fit curves using a linear combination of:
• The basis functions can be:
• Constant
• “Hinge” functions of the form: and
• Products of hinges
https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
max(0, x − b) max(0, b − x)
̂
f (x) =
∑
i
ciBi (x)
Bi (x)
www.data4sci.com
@bgoncalves
Stepwise Regression
• Multivariate Adaptive Regression Spline (MARS) is the best known example
• Fit curves using a linear combination of:
https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
+ 1.591 max (0, x 0.907)
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
̂
f (x) =
∑
i
ciBi (x)
www.data4sci.com
@bgoncalves
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• obtain the inputs
• multiply the inputs by the respective weights
• calculate output using the activation function
• To create a multi-layer perceptron, you can simply use the output of one layer as the input to
the next one.
• But how can we propagate back the errors and update the weights?
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
aj
w
0j
1
wT
x
1
w
0k
w
1k
w2k
w3k
wNk
ak
wT
a
a1
a2
aN
www.data4sci.com
@bgoncalves
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
̂
f (x) =
∑
i
ciBi (x)
y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
+ 1.591 max (0, x 0.907)
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
ReLu
1
x
ReLu
1
x
ReLu
1
x
Linear
1
www.data4sci.com
@bgoncalves
The Cross Entropy is complementary to sigmoid
activation in the output layer and improves its stability.
Loss Functions
• For learning to occur, we must quantify how far off we are from the desired output. There are
two common ways of doing this:
• Quadratic error function:
• Cross Entropy
E =
1
N
X
n
|yn an|
2
J =
1
N
X
n
h
yT
n log an + (1 yn)
T
log (1 an)
i
www.data4sci.com
@bgoncalves
Regularization
• Helps keep weights relatively small by adding a penalization to the cost function.
• Two common choices:
• Lasso helps with feature selection by driving less important weights to zero
̂
Jw (X) = Jw (X) + λ
∑
ij
w2
ij
̂
Jw (X) = Jw (X) + λ
∑
ij
wij “Lasso”
L2
www.data4sci.com
@bgoncalves
Backward Propagation of Errors (BackProp)
• BackProp operates in two phases:
• Forward propagate the inputs and calculate the deltas
• Update the weights
• The error at the output is a weighted average difference between predicted output and the
observed one.
• For inner layers there is no “real output”!
www.data4sci.com
@bgoncalves
• Let be the error at each of the total layers:
• Then:
• And for every other layer, in reverse order:
• Until:
as there’s no error on the input layer.
• And finally:
BackProp
δ(l)
δ(L)
= hw (X) − y
δ(l)
= W(l)T
δ(l+1)
. * ϕ†
(z(l)
)
L
δ(1)
≡ 0
Δ(l)
ij
= Δ(l)
ij
+ a(l)
j
δ(l+1)
i
∂
∂w(l)
ij
Jw (X, ⃗
y ) =
1
m
Δ(l)
ij
+ λw(l)
ij
@bgoncalves
A practical example - MNIST
8
2
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample N
Feature
1
Feature
3
Feature
2
…
Label
Feature
M
yann.lecun.com/exdb/mnist/
@bgoncalves
A practical example - MNIST
yann.lecun.com/exdb/mnist/
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample M
Feature
1
Feature
3
Feature
2
…
Label
Feature
N
X y
3 layers:
1 input layer
1 hidden layer
1 output layer
X ⇥1 ⇥2
arg
max
www.data4sci.com
@bgoncalves
A practical example - MNIST
784 50 10 1
5000
examples
10 ⇥ 51
50 ⇥ 785
def predict(Theta1, Theta2, X):
h1 = forward(Theta1, X, sigmoid)
h2 = forward(Theta2, h1, sigmoid)
return np.argmax(h2, 1)
50 ⇥ 784 10 ⇥ 50
Forward Propagation
Backward Propagation
Vectors
Matrices
Matrices
X ⇥1 ⇥2
arg
max
def forward(Theta, X, active):
N = X.shape[0]
# Add the bias column
X_ = np.concatenate((np.ones((N, 1)), X), 1)
# Multiply by the weights
z = np.dot(X_, Theta.T)
# Apply the activation function
a = active(z)
return a
Code - Simple Network
https://github.com/DataForScience/DeepLearning

More Related Content

Similar to ppt - Deep Learning From Scratch.pdf

Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
ssuser7f0b19
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
Paris Women in Machine Learning and Data Science
 
Matlab solved problems
Matlab solved problemsMatlab solved problems
Matlab solved problems
Make Mannan
 
Yin Yangs of Software Development
Yin Yangs of Software DevelopmentYin Yangs of Software Development
Yin Yangs of Software Development
Naveenkumar Muguda
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
anandsimple
 
java 8 Hands on Workshop
java 8 Hands on Workshopjava 8 Hands on Workshop
java 8 Hands on Workshop
Jeanne Boyarsky
 
Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"
Fwdays
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
4.Support Vector Machines.ppt machine learning and development
4.Support Vector Machines.ppt machine learning and development4.Support Vector Machines.ppt machine learning and development
4.Support Vector Machines.ppt machine learning and development
PriyankaRamavath3
 
COSCUP: Introduction to Julia
COSCUP: Introduction to JuliaCOSCUP: Introduction to Julia
COSCUP: Introduction to Julia
岳華 杜
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
EmanAl15
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
Yogendra Singh
 
Algebra 1
Algebra 1Algebra 1
Algebra 1
April Smith
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
ChenYiHuang5
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Universitat Politècnica de Catalunya
 
20170415 當julia遇上資料科學
20170415 當julia遇上資料科學20170415 當julia遇上資料科學
20170415 當julia遇上資料科學
岳華 杜
 
20171127 當julia遇上資料科學
20171127 當julia遇上資料科學20171127 當julia遇上資料科學
20171127 當julia遇上資料科學
岳華 杜
 
Support Vector Machine.ppt
Support Vector Machine.pptSupport Vector Machine.ppt
Support Vector Machine.ppt
NBACriteria2SICET
 
svm.ppt
svm.pptsvm.ppt
svm.ppt
RanjithaM32
 

Similar to ppt - Deep Learning From Scratch.pdf (20)

Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
 
Mit6 094 iap10_lec03
Mit6 094 iap10_lec03Mit6 094 iap10_lec03
Mit6 094 iap10_lec03
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
 
Matlab solved problems
Matlab solved problemsMatlab solved problems
Matlab solved problems
 
Yin Yangs of Software Development
Yin Yangs of Software DevelopmentYin Yangs of Software Development
Yin Yangs of Software Development
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
 
java 8 Hands on Workshop
java 8 Hands on Workshopjava 8 Hands on Workshop
java 8 Hands on Workshop
 
Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
4.Support Vector Machines.ppt machine learning and development
4.Support Vector Machines.ppt machine learning and development4.Support Vector Machines.ppt machine learning and development
4.Support Vector Machines.ppt machine learning and development
 
COSCUP: Introduction to Julia
COSCUP: Introduction to JuliaCOSCUP: Introduction to Julia
COSCUP: Introduction to Julia
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Algebra 1
Algebra 1Algebra 1
Algebra 1
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
20170415 當julia遇上資料科學
20170415 當julia遇上資料科學20170415 當julia遇上資料科學
20170415 當julia遇上資料科學
 
20171127 當julia遇上資料科學
20171127 當julia遇上資料科學20171127 當julia遇上資料科學
20171127 當julia遇上資料科學
 
Support Vector Machine.ppt
Support Vector Machine.pptSupport Vector Machine.ppt
Support Vector Machine.ppt
 
svm.ppt
svm.pptsvm.ppt
svm.ppt
 

Recently uploaded

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 

ppt - Deep Learning From Scratch.pdf

  • 1. www.data4sci.com @bgoncalves Optimization Problem • (Machine) Learning can be thought of as an optimization problem. • Optimization Problems have 3 distinct pieces: • The constraints • The function to optimize • The optimization algorithm.
  • 2. www.data4sci.com @bgoncalves Optimization Problem • (Machine) Learning can be thought of as an optimization problem. • Optimization Problems have 3 distinct pieces: • The constraints • The function to optimize • The optimization algorithm. Problem Representation Prediction Error Gradient Descent
  • 3. www.data4sci.com @bgoncalves Linear Regression • We are assuming that our functional dependence is of the form: • In other words, at each step, our hypothesis is: and it imposes a Constraint on the solutions that can be found. • We quantify our far our hypothesis is from the correct value using an Error Function: or, vectorially: Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 . Sample M Feature 1 Feature 3 Feature 2 … value Feature N X y f ( ⃗ x ) = w0 + w1x1 + ⋯ + wnxn ≡ X ⃗ w Jw (X, ⃗ y ) = 1 2m ∑ i [hw (x(i) ) − y(i) ] 2 Jw (X, ⃗ y ) = 1 2m [X ⃗ w − ⃗ y ] 2 hw (X) = X ⃗ w ≡ ̂ y
  • 5. www.data4sci.com @bgoncalves 0 3.25 6.5 9.75 13 0 5 10 15 20 Geometric Interpretation Quadratic error means that an error twice as large is penalized four times as much. Jw (X, ⃗ y ) = 1 2m [X ⃗ w − ⃗ y ] 2
  • 6. www.data4sci.com @bgoncalves Gradient Descent • Goal: Find the minimum of by varying the components of • Intuition: Follow the slope of the error function until convergence • Algorithm: • Guess (initial values of the parameters) • Update until “convergence”: Jw (X, ⃗ y ) ⃗ w − δ δ ⃗ w Jw (X, ⃗ y ) w J w ( X, ⃗ y ) ⃗ w(0) wj = wj − α δ δwj Jw (X, ⃗ y ) J w ( X , ⃗ y ) − δ δ ⃗ w Jw (X, ⃗ y ) step size δ δwj Jw (X, ⃗ y ) = 1 m XT ⋅ (hw (X) − ⃗ y )
  • 7. www.data4sci.com @bgoncalves ` 0 3.25 6.5 9.75 13 0 5 10 15 20 2D Geometric Interpretation y = w0 + w1x1 3D y = w0 + w1x1 + w2x2 nD y = X ⃗ w Add to account for intercept x0 ≡ 1 Finds the hyperplane that splits the points in two such that the errors on each side balance out
  • 8. Code - Linear Regression https://github.com/DataForScience/DeepLearning
  • 9. www.data4sci.com @bgoncalves Learning Procedure input hypothesis predicted output XT hw (X) ⃗ ̂ y Learning Algorithm observed output Error Function Jw (X, ⃗ y ) Constraint
  • 10. www.data4sci.com @bgoncalves Learning Procedure input hypothesis predicted output XT hw (X) = ϕ (XT w) ⃗ ̂ y Learning Algorithm observed output Error Function Jw (X, ⃗ y ) Constraint Which we can redefine
  • 11. www.data4sci.com @bgoncalves Learning Procedure hypothesis predicted output ϕ (z) ⃗ ̂ y Learning Algorithm observed output Error Function Jw (X, ⃗ y ) Constraint input XT z = XT w And rewrite
  • 12. @bgoncalves Logistic Regression (Classification) • Not actually regression, but rather Classification • Predict the probability of instance belonging to the given class: • Use sigmoid/logistic function to map weighted inputs to hw (X) ∈ [0,1] hw (X) = ϕ (X ⃗ w) [0,1] z encapsulates all the parameters and input values 1 - part of the class 0 - otherwise
  • 13. www.data4sci.com @bgoncalves Geometric Interpretation ϕ (z) ≥ 1 2 maximize the value of z for members of the class
  • 14. www.data4sci.com @bgoncalves Logistic Regression • Error Function - Cross Entropy measures the “distance” between two probability distributions • Effectively treating the labels as probabilities (an instance with label=1 has Probability 1 of belonging to the class). • Gradient - same as Logistic Regression Jw (X, ⃗ y ) = − 1 m [yT log (hw (X)) + (1 − y) T log (1 − hw (X))] hw (X) = 1 1 + e−X ⃗ w wj = wj − α δ δwj Jw (X, ⃗ y ) δ δwj Jw (X, ⃗ y ) = 1 m XT ⋅ (hw (X) − ⃗ y )
  • 17. Code - Logistic Regression https://github.com/DataForScience/DeepLearning
  • 20. www.data4sci.com @bgoncalves Learning Procedure hypothesis predicted output ϕ (z) ⃗ ̂ y Learning Algorithm observed output Error Function Jw (X, ⃗ y ) Constraint input XT z = XT w
  • 21. www.data4sci.com @bgoncalves Comparison • Linear Regression • Logistic Regression δ δwj Jw (X, ⃗ y ) = 1 m XT ⋅ (hw (X) − ⃗ y ) δ δwj Jw (X, ⃗ y ) = 1 m XT ⋅ (hw (X) − ⃗ y ) Jw (X, ⃗ y ) = − 1 m [yT log (hw (X)) + (1 − y) T log (1 − hw (X))] Jw (X, ⃗ y ) = 1 2m [hw (X) − ⃗ y ] 2 hw (X) = ϕ (Z) hw (X) = ϕ (Z) z = X ⃗ w z = X ⃗ w ϕ (Z) = 1 1 + e−Z ϕ (Z) = Z Map features to a continuous variable Predict based on continuous variable Compare prediction with reality
  • 23. www.data4sci.com @bgoncalves Generalized Perceptron x1 x2 x3 xN w 1j w2j w3j wN j zj wT x (z) w 0j 1 Inputs Weights Activation function Bias By changing the activation function, we change the underlying algorithm
  • 24. www.data4sci.com @bgoncalves Activation Function • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data
  • 25. www.data4sci.com @bgoncalves Activation Function - Linear • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • The simplest ϕ (z) = z
  • 26. www.data4sci.com @bgoncalves Activation Function - Linear • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • The simplest Linear Regression ϕ (z) = z
  • 27. www.data4sci.com @bgoncalves Activation Function - Sigmoid • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • Perhaps the most common ϕ (z) = 1 1 + e−z
  • 28. www.data4sci.com @bgoncalves Activation Function - Sigmoid • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • Perhaps the most common ϕ (z) = 1 1 + e−z Logistic Regression
  • 29. www.data4sci.com @bgoncalves Forward Propagation • The output of a perceptron is determined by a sequence of steps: • obtain the inputs • multiply the inputs by the respective weights • calculate output using the activation function
  • 30. Code - Forward Propagation https://github.com/DataForScience/DeepLearning
  • 31. www.data4sci.com @bgoncalves Activation Function - ReLu • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • Results in faster learning than with sigmoid ϕ (z) = z, z > 0
  • 32. www.data4sci.com @bgoncalves ϕ (z) = z, z > 0 Activation Function - ReLu • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • Results in faster learning than with sigmoid Stepwise Regression
  • 33. www.data4sci.com @bgoncalves Stepwise Regression • Multivariate Adaptive Regression Spline (MARS) is the best known example • Fit curves using a linear combination of: • The basis functions can be: • Constant • “Hinge” functions of the form: and • Products of hinges https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline max(0, x − b) max(0, b − x) ̂ f (x) = ∑ i ciBi (x) Bi (x)
  • 34. www.data4sci.com @bgoncalves Stepwise Regression • Multivariate Adaptive Regression Spline (MARS) is the best known example • Fit curves using a linear combination of: https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline y (x) = 1.013 + 1.198 max (0, x 0.485) 1.803 max (0, 0.485 x) 1.321 max (0, x 0.283) 1.609 max (0, x 0.640) + 1.591 max (0, x 0.907) <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> ̂ f (x) = ∑ i ciBi (x)
  • 35. www.data4sci.com @bgoncalves Forward Propagation • The output of a perceptron is determined by a sequence of steps: • obtain the inputs • multiply the inputs by the respective weights • calculate output using the activation function • To create a multi-layer perceptron, you can simply use the output of one layer as the input to the next one. • But how can we propagate back the errors and update the weights? x1 x2 x3 xN w 1j w2j w3j wN j aj w 0j 1 wT x 1 w 0k w 1k w2k w3k wNk ak wT a a1 a2 aN
  • 36. www.data4sci.com @bgoncalves Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline ̂ f (x) = ∑ i ciBi (x) y (x) = 1.013 + 1.198 max (0, x 0.485) 1.803 max (0, 0.485 x) 1.321 max (0, x 0.283) 1.609 max (0, x 0.640) + 1.591 max (0, x 0.907) <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> ReLu 1 x ReLu 1 x ReLu 1 x Linear 1
  • 37. www.data4sci.com @bgoncalves The Cross Entropy is complementary to sigmoid activation in the output layer and improves its stability. Loss Functions • For learning to occur, we must quantify how far off we are from the desired output. There are two common ways of doing this: • Quadratic error function: • Cross Entropy E = 1 N X n |yn an| 2 J = 1 N X n h yT n log an + (1 yn) T log (1 an) i
  • 38. www.data4sci.com @bgoncalves Regularization • Helps keep weights relatively small by adding a penalization to the cost function. • Two common choices: • Lasso helps with feature selection by driving less important weights to zero ̂ Jw (X) = Jw (X) + λ ∑ ij w2 ij ̂ Jw (X) = Jw (X) + λ ∑ ij wij “Lasso” L2
  • 39. www.data4sci.com @bgoncalves Backward Propagation of Errors (BackProp) • BackProp operates in two phases: • Forward propagate the inputs and calculate the deltas • Update the weights • The error at the output is a weighted average difference between predicted output and the observed one. • For inner layers there is no “real output”!
  • 40. www.data4sci.com @bgoncalves • Let be the error at each of the total layers: • Then: • And for every other layer, in reverse order: • Until: as there’s no error on the input layer. • And finally: BackProp δ(l) δ(L) = hw (X) − y δ(l) = W(l)T δ(l+1) . * ϕ† (z(l) ) L δ(1) ≡ 0 Δ(l) ij = Δ(l) ij + a(l) j δ(l+1) i ∂ ∂w(l) ij Jw (X, ⃗ y ) = 1 m Δ(l) ij + λw(l) ij
  • 41. @bgoncalves A practical example - MNIST 8 2 Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 . Sample N Feature 1 Feature 3 Feature 2 … Label Feature M yann.lecun.com/exdb/mnist/
  • 42. @bgoncalves A practical example - MNIST yann.lecun.com/exdb/mnist/ Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 . Sample M Feature 1 Feature 3 Feature 2 … Label Feature N X y 3 layers: 1 input layer 1 hidden layer 1 output layer X ⇥1 ⇥2 arg max
  • 43. www.data4sci.com @bgoncalves A practical example - MNIST 784 50 10 1 5000 examples 10 ⇥ 51 50 ⇥ 785 def predict(Theta1, Theta2, X): h1 = forward(Theta1, X, sigmoid) h2 = forward(Theta2, h1, sigmoid) return np.argmax(h2, 1) 50 ⇥ 784 10 ⇥ 50 Forward Propagation Backward Propagation Vectors Matrices Matrices X ⇥1 ⇥2 arg max def forward(Theta, X, active): N = X.shape[0] # Add the bias column X_ = np.concatenate((np.ones((N, 1)), X), 1) # Multiply by the weights z = np.dot(X_, Theta.T) # Apply the activation function a = active(z) return a
  • 44. Code - Simple Network https://github.com/DataForScience/DeepLearning