www.data4sci.com
@bgoncalves
Optimization Problem
• (Machine) Learning can be thought of as an optimization problem.
• Optimization Problems have 3 distinct pieces:
• The constraints
• The function to optimize
• The optimization algorithm.
www.data4sci.com
@bgoncalves
Optimization Problem
• (Machine) Learning can be thought of as an optimization problem.
• Optimization Problems have 3 distinct pieces:
• The constraints
• The function to optimize
• The optimization algorithm.
Problem Representation
Prediction Error
Gradient Descent
www.data4sci.com
@bgoncalves
Linear Regression
• We are assuming that our functional dependence is of the form:
• In other words, at each step, our hypothesis is:
and it imposes a Constraint on the solutions that can be found.
• We quantify our far our hypothesis is from the correct value using
an Error Function:
or, vectorially:
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample M
Feature
1
Feature
3
Feature
2
…
value
Feature
N
X y
f ( ⃗
x ) = w0 + w1x1 + ⋯ + wnxn ≡ X ⃗
w
Jw (X, ⃗
y ) =
1
2m ∑
i
[hw (x(i)
) − y(i)
]
2
Jw (X, ⃗
y ) =
1
2m
[X ⃗
w − ⃗
y ]
2
hw (X) = X ⃗
w ≡ ̂
y
www.data4sci.com
@bgoncalves
Linear Regression
www.data4sci.com
@bgoncalves
0
3.25
6.5
9.75
13
0 5 10 15 20
Geometric Interpretation
Quadratic error
means that an error
twice as large is
penalized four times
as much.
Jw (X, ⃗
y ) =
1
2m
[X ⃗
w − ⃗
y ]
2
www.data4sci.com
@bgoncalves
Gradient Descent
• Goal: Find the minimum of by varying the components of
• Intuition: Follow the slope of the error function until convergence
• Algorithm:
• Guess (initial values of the parameters)
• Update until “convergence”:
Jw (X, ⃗
y ) ⃗
w
−
δ
δ ⃗
w
Jw (X, ⃗
y )
w
J
w
(
X,
⃗
y
)
⃗
w(0)
wj = wj − α
δ
δwj
Jw (X, ⃗
y )
J
w
(
X
,
⃗
y
)
−
δ
δ ⃗
w
Jw (X, ⃗
y )
step size
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )
www.data4sci.com
@bgoncalves
`
0
3.25
6.5
9.75
13
0 5 10 15 20
2D
Geometric Interpretation
y = w0 + w1x1
3D
y = w0 + w1x1 + w2x2
nD
y = X ⃗
w
Add
to account
for intercept
x0 ≡ 1
Finds the hyperplane that
splits the points in two
such that the errors on
each side balance out
Code - Linear Regression
https://github.com/DataForScience/DeepLearning
www.data4sci.com
@bgoncalves
Learning Procedure
input hypothesis
predicted
output
XT hw (X) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint
www.data4sci.com
@bgoncalves
Learning Procedure
input hypothesis
predicted
output
XT hw (X) = ϕ (XT
w) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint
Which we
can redefine
www.data4sci.com
@bgoncalves
Learning Procedure
hypothesis
predicted
output
ϕ (z) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint
input
XT
z = XT
w
And
rewrite
@bgoncalves
Logistic Regression (Classification)
• Not actually regression, but rather Classification
• Predict the probability of instance belonging to the given class:
• Use sigmoid/logistic function to map weighted inputs to
hw (X) ∈ [0,1]
hw (X) = ϕ (X ⃗
w)
[0,1]
z encapsulates all
the parameters and
input values
1 - part of the class
0 - otherwise
www.data4sci.com
@bgoncalves
Geometric Interpretation
ϕ (z) ≥
1
2
maximize the value
of z for members of
the class
www.data4sci.com
@bgoncalves
Logistic Regression
• Error Function - Cross Entropy
measures the “distance” between two probability distributions
• Effectively treating the labels as probabilities (an instance with label=1 has Probability 1 of
belonging to the class).
• Gradient - same as Logistic Regression
Jw (X, ⃗
y ) = −
1
m [yT
log (hw (X)) + (1 − y)
T
log (1 − hw (X))]
hw (X) =
1
1 + e−X ⃗
w
wj = wj − α
δ
δwj
Jw (X, ⃗
y )
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )
www.data4sci.com
@bgoncalves
Iris dataset
www.data4sci.com
@bgoncalves
Iris dataset
Code - Logistic Regression
https://github.com/DataForScience/DeepLearning
www.data4sci.com
@bgoncalves
Logistic Regression
www.data4sci.com
@bgoncalves
Logistic Regression
www.data4sci.com
@bgoncalves
Learning Procedure
hypothesis
predicted
output
ϕ (z) ⃗
̂
y
Learning
Algorithm
observed
output
Error
Function
Jw (X, ⃗
y )
Constraint
input
XT
z = XT
w
www.data4sci.com
@bgoncalves
Comparison
• Linear Regression • Logistic Regression
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )
δ
δwj
Jw (X, ⃗
y ) =
1
m
XT
⋅ (hw (X) − ⃗
y )
Jw (X, ⃗
y ) = −
1
m [yT
log (hw (X)) + (1 − y)
T
log (1 − hw (X))]
Jw (X, ⃗
y ) =
1
2m
[hw (X) − ⃗
y ]
2
hw (X) = ϕ (Z)
hw (X) = ϕ (Z)
z = X ⃗
w z = X ⃗
w
ϕ (Z) =
1
1 + e−Z
ϕ (Z) = Z
Map features to a
continuous variable
Predict based on
continuous variable
Compare
prediction with
reality
www.data4sci.com
@bgoncalves
Learning Procedure
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
zj
wT
x (z)
w
0j
1
Inputs Weights
Activation
function
Bias
www.data4sci.com
@bgoncalves
Generalized Perceptron
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
zj
wT
x (z)
w
0j
1
Inputs Weights
Activation
function
Bias
By changing the
activation function,
we change the
underlying algorithm
www.data4sci.com
@bgoncalves
Activation Function
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract representation of the data
www.data4sci.com
@bgoncalves
Activation Function - Linear
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• The simplest
ϕ (z) = z
www.data4sci.com
@bgoncalves
Activation Function - Linear
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• The simplest
Linear Regression
ϕ (z) = z
www.data4sci.com
@bgoncalves
Activation Function - Sigmoid
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• Perhaps the most common
ϕ (z) =
1
1 + e−z
www.data4sci.com
@bgoncalves
Activation Function - Sigmoid
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• Perhaps the most common
ϕ (z) =
1
1 + e−z
Logistic Regression
www.data4sci.com
@bgoncalves
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• obtain the inputs
• multiply the inputs by the respective weights
• calculate output using the activation function
Code - Forward Propagation
https://github.com/DataForScience/DeepLearning
www.data4sci.com
@bgoncalves
Activation Function - ReLu
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• Results in faster learning than with
sigmoid
ϕ (z) = z, z > 0
www.data4sci.com
@bgoncalves
ϕ (z) = z, z > 0
Activation Function - ReLu
• Non-Linear function
• Differentiable
• non-decreasing
• Compute new sets of features
• Each layer builds up a more abstract
representation of the data
• Results in faster learning than with
sigmoid
Stepwise Regression
www.data4sci.com
@bgoncalves
Stepwise Regression
• Multivariate Adaptive Regression Spline (MARS) is the best known example
• Fit curves using a linear combination of:
• The basis functions can be:
• Constant
• “Hinge” functions of the form: and
• Products of hinges
https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
max(0, x − b) max(0, b − x)
̂
f (x) =
∑
i
ciBi (x)
Bi (x)
www.data4sci.com
@bgoncalves
Stepwise Regression
• Multivariate Adaptive Regression Spline (MARS) is the best known example
• Fit curves using a linear combination of:
https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
+ 1.591 max (0, x 0.907)
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
̂
f (x) =
∑
i
ciBi (x)
www.data4sci.com
@bgoncalves
Forward Propagation
• The output of a perceptron is determined by a sequence of steps:
• obtain the inputs
• multiply the inputs by the respective weights
• calculate output using the activation function
• To create a multi-layer perceptron, you can simply use the output of one layer as the input to
the next one.
• But how can we propagate back the errors and update the weights?
x1
x2
x3
xN
w
1j
w2j
w3j
wN
j
aj
w
0j
1
wT
x
1
w
0k
w
1k
w2k
w3k
wNk
ak
wT
a
a1
a2
aN
www.data4sci.com
@bgoncalves
Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline
̂
f (x) =
∑
i
ciBi (x)
y (x) = 1.013
+ 1.198 max (0, x 0.485)
1.803 max (0, 0.485 x)
1.321 max (0, x 0.283)
1.609 max (0, x 0.640)
+ 1.591 max (0, x 0.907)
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
<latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit>
ReLu
1
x
ReLu
1
x
ReLu
1
x
Linear
1
www.data4sci.com
@bgoncalves
The Cross Entropy is complementary to sigmoid
activation in the output layer and improves its stability.
Loss Functions
• For learning to occur, we must quantify how far off we are from the desired output. There are
two common ways of doing this:
• Quadratic error function:
• Cross Entropy
E =
1
N
X
n
|yn an|
2
J =
1
N
X
n
h
yT
n log an + (1 yn)
T
log (1 an)
i
www.data4sci.com
@bgoncalves
Regularization
• Helps keep weights relatively small by adding a penalization to the cost function.
• Two common choices:
• Lasso helps with feature selection by driving less important weights to zero
̂
Jw (X) = Jw (X) + λ
∑
ij
w2
ij
̂
Jw (X) = Jw (X) + λ
∑
ij
wij “Lasso”
L2
www.data4sci.com
@bgoncalves
Backward Propagation of Errors (BackProp)
• BackProp operates in two phases:
• Forward propagate the inputs and calculate the deltas
• Update the weights
• The error at the output is a weighted average difference between predicted output and the
observed one.
• For inner layers there is no “real output”!
www.data4sci.com
@bgoncalves
• Let be the error at each of the total layers:
• Then:
• And for every other layer, in reverse order:
• Until:
as there’s no error on the input layer.
• And finally:
BackProp
δ(l)
δ(L)
= hw (X) − y
δ(l)
= W(l)T
δ(l+1)
. * ϕ†
(z(l)
)
L
δ(1)
≡ 0
Δ(l)
ij
= Δ(l)
ij
+ a(l)
j
δ(l+1)
i
∂
∂w(l)
ij
Jw (X, ⃗
y ) =
1
m
Δ(l)
ij
+ λw(l)
ij
@bgoncalves
A practical example - MNIST
8
2
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample N
Feature
1
Feature
3
Feature
2
…
Label
Feature
M
yann.lecun.com/exdb/mnist/
@bgoncalves
A practical example - MNIST
yann.lecun.com/exdb/mnist/
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
.
Sample M
Feature
1
Feature
3
Feature
2
…
Label
Feature
N
X y
3 layers:
1 input layer
1 hidden layer
1 output layer
X ⇥1 ⇥2
arg
max
www.data4sci.com
@bgoncalves
A practical example - MNIST
784 50 10 1
5000
examples
10 ⇥ 51
50 ⇥ 785
def predict(Theta1, Theta2, X):
h1 = forward(Theta1, X, sigmoid)
h2 = forward(Theta2, h1, sigmoid)
return np.argmax(h2, 1)
50 ⇥ 784 10 ⇥ 50
Forward Propagation
Backward Propagation
Vectors
Matrices
Matrices
X ⇥1 ⇥2
arg
max
def forward(Theta, X, active):
N = X.shape[0]
# Add the bias column
X_ = np.concatenate((np.ones((N, 1)), X), 1)
# Multiply by the weights
z = np.dot(X_, Theta.T)
# Apply the activation function
a = active(z)
return a
Code - Simple Network
https://github.com/DataForScience/DeepLearning

ppt - Deep Learning From Scratch.pdf

  • 1.
    www.data4sci.com @bgoncalves Optimization Problem • (Machine)Learning can be thought of as an optimization problem. • Optimization Problems have 3 distinct pieces: • The constraints • The function to optimize • The optimization algorithm.
  • 2.
    www.data4sci.com @bgoncalves Optimization Problem • (Machine)Learning can be thought of as an optimization problem. • Optimization Problems have 3 distinct pieces: • The constraints • The function to optimize • The optimization algorithm. Problem Representation Prediction Error Gradient Descent
  • 3.
    www.data4sci.com @bgoncalves Linear Regression • Weare assuming that our functional dependence is of the form: • In other words, at each step, our hypothesis is: and it imposes a Constraint on the solutions that can be found. • We quantify our far our hypothesis is from the correct value using an Error Function: or, vectorially: Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 . Sample M Feature 1 Feature 3 Feature 2 … value Feature N X y f ( ⃗ x ) = w0 + w1x1 + ⋯ + wnxn ≡ X ⃗ w Jw (X, ⃗ y ) = 1 2m ∑ i [hw (x(i) ) − y(i) ] 2 Jw (X, ⃗ y ) = 1 2m [X ⃗ w − ⃗ y ] 2 hw (X) = X ⃗ w ≡ ̂ y
  • 4.
  • 5.
    www.data4sci.com @bgoncalves 0 3.25 6.5 9.75 13 0 5 1015 20 Geometric Interpretation Quadratic error means that an error twice as large is penalized four times as much. Jw (X, ⃗ y ) = 1 2m [X ⃗ w − ⃗ y ] 2
  • 6.
    www.data4sci.com @bgoncalves Gradient Descent • Goal:Find the minimum of by varying the components of • Intuition: Follow the slope of the error function until convergence • Algorithm: • Guess (initial values of the parameters) • Update until “convergence”: Jw (X, ⃗ y ) ⃗ w − δ δ ⃗ w Jw (X, ⃗ y ) w J w ( X, ⃗ y ) ⃗ w(0) wj = wj − α δ δwj Jw (X, ⃗ y ) J w ( X , ⃗ y ) − δ δ ⃗ w Jw (X, ⃗ y ) step size δ δwj Jw (X, ⃗ y ) = 1 m XT ⋅ (hw (X) − ⃗ y )
  • 7.
    www.data4sci.com @bgoncalves ` 0 3.25 6.5 9.75 13 0 5 1015 20 2D Geometric Interpretation y = w0 + w1x1 3D y = w0 + w1x1 + w2x2 nD y = X ⃗ w Add to account for intercept x0 ≡ 1 Finds the hyperplane that splits the points in two such that the errors on each side balance out
  • 8.
    Code - LinearRegression https://github.com/DataForScience/DeepLearning
  • 9.
    www.data4sci.com @bgoncalves Learning Procedure input hypothesis predicted output XThw (X) ⃗ ̂ y Learning Algorithm observed output Error Function Jw (X, ⃗ y ) Constraint
  • 10.
    www.data4sci.com @bgoncalves Learning Procedure input hypothesis predicted output XThw (X) = ϕ (XT w) ⃗ ̂ y Learning Algorithm observed output Error Function Jw (X, ⃗ y ) Constraint Which we can redefine
  • 11.
    www.data4sci.com @bgoncalves Learning Procedure hypothesis predicted output ϕ (z)⃗ ̂ y Learning Algorithm observed output Error Function Jw (X, ⃗ y ) Constraint input XT z = XT w And rewrite
  • 12.
    @bgoncalves Logistic Regression (Classification) •Not actually regression, but rather Classification • Predict the probability of instance belonging to the given class: • Use sigmoid/logistic function to map weighted inputs to hw (X) ∈ [0,1] hw (X) = ϕ (X ⃗ w) [0,1] z encapsulates all the parameters and input values 1 - part of the class 0 - otherwise
  • 13.
    www.data4sci.com @bgoncalves Geometric Interpretation ϕ (z)≥ 1 2 maximize the value of z for members of the class
  • 14.
    www.data4sci.com @bgoncalves Logistic Regression • ErrorFunction - Cross Entropy measures the “distance” between two probability distributions • Effectively treating the labels as probabilities (an instance with label=1 has Probability 1 of belonging to the class). • Gradient - same as Logistic Regression Jw (X, ⃗ y ) = − 1 m [yT log (hw (X)) + (1 − y) T log (1 − hw (X))] hw (X) = 1 1 + e−X ⃗ w wj = wj − α δ δwj Jw (X, ⃗ y ) δ δwj Jw (X, ⃗ y ) = 1 m XT ⋅ (hw (X) − ⃗ y )
  • 15.
  • 16.
  • 17.
    Code - LogisticRegression https://github.com/DataForScience/DeepLearning
  • 18.
  • 19.
  • 20.
    www.data4sci.com @bgoncalves Learning Procedure hypothesis predicted output ϕ (z)⃗ ̂ y Learning Algorithm observed output Error Function Jw (X, ⃗ y ) Constraint input XT z = XT w
  • 21.
    www.data4sci.com @bgoncalves Comparison • Linear Regression• Logistic Regression δ δwj Jw (X, ⃗ y ) = 1 m XT ⋅ (hw (X) − ⃗ y ) δ δwj Jw (X, ⃗ y ) = 1 m XT ⋅ (hw (X) − ⃗ y ) Jw (X, ⃗ y ) = − 1 m [yT log (hw (X)) + (1 − y) T log (1 − hw (X))] Jw (X, ⃗ y ) = 1 2m [hw (X) − ⃗ y ] 2 hw (X) = ϕ (Z) hw (X) = ϕ (Z) z = X ⃗ w z = X ⃗ w ϕ (Z) = 1 1 + e−Z ϕ (Z) = Z Map features to a continuous variable Predict based on continuous variable Compare prediction with reality
  • 22.
  • 23.
    www.data4sci.com @bgoncalves Generalized Perceptron x1 x2 x3 xN w 1j w2j w3j wN j zj wT x (z) w 0j 1 InputsWeights Activation function Bias By changing the activation function, we change the underlying algorithm
  • 24.
    www.data4sci.com @bgoncalves Activation Function • Non-Linearfunction • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data
  • 25.
    www.data4sci.com @bgoncalves Activation Function -Linear • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • The simplest ϕ (z) = z
  • 26.
    www.data4sci.com @bgoncalves Activation Function -Linear • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • The simplest Linear Regression ϕ (z) = z
  • 27.
    www.data4sci.com @bgoncalves Activation Function -Sigmoid • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • Perhaps the most common ϕ (z) = 1 1 + e−z
  • 28.
    www.data4sci.com @bgoncalves Activation Function -Sigmoid • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • Perhaps the most common ϕ (z) = 1 1 + e−z Logistic Regression
  • 29.
    www.data4sci.com @bgoncalves Forward Propagation • Theoutput of a perceptron is determined by a sequence of steps: • obtain the inputs • multiply the inputs by the respective weights • calculate output using the activation function
  • 30.
    Code - ForwardPropagation https://github.com/DataForScience/DeepLearning
  • 31.
    www.data4sci.com @bgoncalves Activation Function -ReLu • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • Results in faster learning than with sigmoid ϕ (z) = z, z > 0
  • 32.
    www.data4sci.com @bgoncalves ϕ (z) =z, z > 0 Activation Function - ReLu • Non-Linear function • Differentiable • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • Results in faster learning than with sigmoid Stepwise Regression
  • 33.
    www.data4sci.com @bgoncalves Stepwise Regression • MultivariateAdaptive Regression Spline (MARS) is the best known example • Fit curves using a linear combination of: • The basis functions can be: • Constant • “Hinge” functions of the form: and • Products of hinges https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline max(0, x − b) max(0, b − x) ̂ f (x) = ∑ i ciBi (x) Bi (x)
  • 34.
    www.data4sci.com @bgoncalves Stepwise Regression • MultivariateAdaptive Regression Spline (MARS) is the best known example • Fit curves using a linear combination of: https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline y (x) = 1.013 + 1.198 max (0, x 0.485) 1.803 max (0, 0.485 x) 1.321 max (0, x 0.283) 1.609 max (0, x 0.640) + 1.591 max (0, x 0.907) <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> ̂ f (x) = ∑ i ciBi (x)
  • 35.
    www.data4sci.com @bgoncalves Forward Propagation • Theoutput of a perceptron is determined by a sequence of steps: • obtain the inputs • multiply the inputs by the respective weights • calculate output using the activation function • To create a multi-layer perceptron, you can simply use the output of one layer as the input to the next one. • But how can we propagate back the errors and update the weights? x1 x2 x3 xN w 1j w2j w3j wN j aj w 0j 1 wT x 1 w 0k w 1k w2k w3k wNk ak wT a a1 a2 aN
  • 36.
    www.data4sci.com @bgoncalves Stepwise Regression https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline ̂ f(x) = ∑ i ciBi (x) y (x) = 1.013 + 1.198 max (0, x 0.485) 1.803 max (0, 0.485 x) 1.321 max (0, x 0.283) 1.609 max (0, x 0.640) + 1.591 max (0, x 0.907) <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> <latexit sha1_base64="8sdMuX6MlC9h4jWMaRP7SJkiDEU=">AAAC9XicbZLPb9MwFMedjB8jA9axIxeLimr8WGS3+5EekCZx4Tgkuk1qqspxndaa4wTbYY2i7u/gwgGEuPK/cOO/wWlzKFmfZOmr9z7v+b1nR5ng2iD013G37t1/8HD7kbfz+MnT3dbeswud5oqyAU1Fqq4iopngkg0MN4JdZYqRJBLsMrp+X8UvvzCleSo/mSJjo4RMJY85Jca6xnvOThixKZcl+yyJUqR4vfCKULDYHMxDxacz86rzroN9hHth6HXeWIn7AbyFCZmvMPQWzuEhRP5RcFxnVOShJQPUa5BLytLzBtnr4o01u0GvQZ6g/kby5AitkVWfx/3NNfvotCa9kMnJ2uTjVhv5aGnwrsC1aIPazsetP+EkpXnCpKGCaD3EKDOjkijDqWALL8w1ywi9JlM2tFKShOlRuXy1BXxpPRMYp8oeaeDSu55RkkTrIoksmRAz081Y5dwUG+YmDkYll1lumKSri+JcQJPC6gvACVeMGlFYQajitldIZ0QRauxH8ewScHPku+Ki62Pk44/d9llQr2MbPAcvwAHA4BScgQ/gHAwAdZTz1fnu/HBv3G/uT/fXCnWdOmcf/Gfu739e5dqD</latexit> ReLu 1 x ReLu 1 x ReLu 1 x Linear 1
  • 37.
    www.data4sci.com @bgoncalves The Cross Entropyis complementary to sigmoid activation in the output layer and improves its stability. Loss Functions • For learning to occur, we must quantify how far off we are from the desired output. There are two common ways of doing this: • Quadratic error function: • Cross Entropy E = 1 N X n |yn an| 2 J = 1 N X n h yT n log an + (1 yn) T log (1 an) i
  • 38.
    www.data4sci.com @bgoncalves Regularization • Helps keepweights relatively small by adding a penalization to the cost function. • Two common choices: • Lasso helps with feature selection by driving less important weights to zero ̂ Jw (X) = Jw (X) + λ ∑ ij w2 ij ̂ Jw (X) = Jw (X) + λ ∑ ij wij “Lasso” L2
  • 39.
    www.data4sci.com @bgoncalves Backward Propagation ofErrors (BackProp) • BackProp operates in two phases: • Forward propagate the inputs and calculate the deltas • Update the weights • The error at the output is a weighted average difference between predicted output and the observed one. • For inner layers there is no “real output”!
  • 40.
    www.data4sci.com @bgoncalves • Let bethe error at each of the total layers: • Then: • And for every other layer, in reverse order: • Until: as there’s no error on the input layer. • And finally: BackProp δ(l) δ(L) = hw (X) − y δ(l) = W(l)T δ(l+1) . * ϕ† (z(l) ) L δ(1) ≡ 0 Δ(l) ij = Δ(l) ij + a(l) j δ(l+1) i ∂ ∂w(l) ij Jw (X, ⃗ y ) = 1 m Δ(l) ij + λw(l) ij
  • 41.
    @bgoncalves A practical example- MNIST 8 2 Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 . Sample N Feature 1 Feature 3 Feature 2 … Label Feature M yann.lecun.com/exdb/mnist/
  • 42.
    @bgoncalves A practical example- MNIST yann.lecun.com/exdb/mnist/ Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 . Sample M Feature 1 Feature 3 Feature 2 … Label Feature N X y 3 layers: 1 input layer 1 hidden layer 1 output layer X ⇥1 ⇥2 arg max
  • 43.
    www.data4sci.com @bgoncalves A practical example- MNIST 784 50 10 1 5000 examples 10 ⇥ 51 50 ⇥ 785 def predict(Theta1, Theta2, X): h1 = forward(Theta1, X, sigmoid) h2 = forward(Theta2, h1, sigmoid) return np.argmax(h2, 1) 50 ⇥ 784 10 ⇥ 50 Forward Propagation Backward Propagation Vectors Matrices Matrices X ⇥1 ⇥2 arg max def forward(Theta, X, active): N = X.shape[0] # Add the bias column X_ = np.concatenate((np.ones((N, 1)), X), 1) # Multiply by the weights z = np.dot(X_, Theta.T) # Apply the activation function a = active(z) return a
  • 44.
    Code - SimpleNetwork https://github.com/DataForScience/DeepLearning