Neural networks

Neural NetworksUniversal Function Approximators
-Prakhar Mishra

Agenda
● Machine Learning Refresher
○ An Example
○ Hierarchical Division
○ Split Ratio
○ Evaluation Metric
● Neural Networks
○ Inspiration
○ Computation Graph
○ Architecture
○ Hyperparameters
○ Regularization
○ Backpropagation

Machine Learning - Quick Refresher

Feature Engineering

Figure out yourself

70%-80% 30%-20%

Machine Learning - Evaluation Metrics
● Confusion Matrix
○ Evaluation for performance of classification model
● Accuracy = (TP + TN) /total samples

Machine Learning - Evaluation Metrics
● Root Mean Squared Error
○ Spread of the predicted y-values about the original y-values.
N = Total Samples
Yi
= Predicted
Yi
= Actual

Rise of Neural Nets
Scale drives
Deep Learning

Learning from Data
Structured Unstructured

Neural Nets - Supervised
Input Output Application
Home Features Cost Real Estate
Ad, User Information Click on Ad ? Online Advertising
Image (1...1000) Class Photo Tagging
Audio Text Speech Recognition
English Chinese Machine Translation

Computation Graph
J(a, b, c) = 3(a + bc)
U = bc
V = a + U
J = 3V
Substitution
U=b*c
b
c
a V= a+U J = 3V
Input
a = 5
b = 3
c = 2
How does J
change if we
change V a bit?
11
33
6
How does J
change if we
change a a bit?
a→V→J
∂J/∂a = (∂J/∂V) x (∂V/∂a)
How does J
change if we
change b a bit?
b→U→V→J
∂J/∂b = (∂J/∂V) x (∂V/∂U) x (∂U/∂b)
Forward →
Backward ←

Architecture
w1
i
1
i2
.
.
in
wn
o1
on
.
.
xF
F = Activation Function
X = w1
*i1
+ w2
*i2
+ . . +wn
*in
+ b
3 Layer NN

Hyperparameters
● There are number of parameters that can be tuned in while building your
neural network.
○ Number of Hidden Layers
○ Epochs
○ Loss Function
○ Optimization Function
○ Weight Initialization
○ Activation Functions
○ Batch Size
○ Learning Rate

Weight Initialization
● If the weights in a network start too small, then the signal shrinks as it
passes through each layer until it’s too tiny to be useful.
● If the weights in a network start too large, then the signal grows as it
passes through each layer until it’s too massive to be useful.
-
Xavier Initialization
-

Weight Initialization
Wi
= √(2 / ni
)

Loss Functions
● Binary Cross Entropy
● Categorical Cross Entropy
● Root Mean Squared Error

Optimization Functions
● Adagrad Optimizer
● Gradient Descent Optimizer
● Adams Optimizer
● Stochastic Gradient Descent Optimizer
● RMSProp Optimizer

Learning Rate
● Decaying the Learning Rate overtime is seen to fasten the learning
process/convergence.

Learning Rate- Formula
1
1 + decay x learning_rate
Alpha0Alpha1

Learning Rate- Special Case
Wi
= Wi-1
+ Alpha x Slope
Pseudo Self Adaptive in
Convex Curve

Activation Functions
Biologically inspired by activity of our brain, where different neurons are
activated by different stimuli.

Activation Functions - Sigmoid

Activation Functions - Standards
● In practice, Tanh outperforms Sigmoid for internal layers.
○ Mean 0, Tanh Function.
○ Mean 0, Sigmoid Function.
○ In ML, we tend to center our data to avoid any kind of bias behaviour.
● Rule of thumb, ReLU for hidden layers generally performs well.
● Avoid Sigmoid for hidden layers.
● Sigmoid is a good candidate for Binary Classification problem.
● Identity Function for hidden layers - No Sense

Activation Functions - ReLU or Tanh ?
ReLU > Tanh
-
Avoids Vanishing Gradient
-
Is it the best ? [No]

Activation Functions - Why ?
Because
fLinear fLinear = fLinear = (N) Layers = (N-X) Layers
-
Trivial Functions are learned
-

Activation Functions - Why ?
● More Advanced Functions - Nonlinear.
● Should be Differentiable - for Backpropagation.

Batch Size
● The Batch Size is the number of samples that will be passed through the
network at a time.
● Advantages
○ Your machine might not fit all the data in-memory at any given instance.
○ You want your model to generalize quickly.

Training - Pre:2
Partial Derivative

Training - Example
0.05
0.10
0.02
Xi
(Input)
Input
0.15
0.30
0.20
Weights
H1
H2
H3
X1
X2
X3
O1 Y
Output
0.33
Input Layer
Hidden Layer
Output Layer

Training - Forward Propagation
Hi
= ∑i=1
wi
*xi
(Compact Representation)
H1
= w1
*x1
+ w2
*x2
+ w3
*x3
(Expanded Representation)
H1
= 0.15*0.05 + 0.20*0.10 + 0.30*0.02
H1
= σ(0.0335) = Hσ1
O1
= ∑i=0
Hi
*wi
(Compact Representation)
O1
= H1
*0.33 = 0.0335*0.33
O1
= σ(0.011055) = Oσ1
σ = 1 / 1 + e-H
Error = |Y - Yi
|

Training - Backward Propagation
The Goal,
Is to update each of the weights in the network so that they
cause the actual output to be closer the target output.

∂Error/∂w4
= (∂Error/∂O 1
) x (∂O 1
/∂O1
) x (∂O1
/∂w4
)
∂Error/∂wi
= Partial derivative w.r.t wi
w4
O 1O1
w
4
Error
∂Error/∂w1
= (∂Error/∂H 1
) x (∂H 1
/∂H1
) x (∂H1
/∂w1
)
= |Y - Yi
|
w1

0.33
0.04
H1
Etotal
= Eo1
+ Eo2
E0
E1
∂Etotal
/∂H 1
= ∂Eo1
/∂H 1
+ ∂Eo2
/∂H 1
∂Eo1
/∂H 1
= (∂Eo1
/∂H1
x (∂H1
/∂H 1
)
∂Eo2
/∂H 1
= (∂Eo2
/∂H1
) x (∂H1
/∂H 1
)
∂Etotal
/∂w1
= (∂Etotal
/∂H 1
) x (∂H 1
/∂H1
) x (∂H1
/∂w1
)
w1
H 1

Works perfect on training data ?

Regularization
Technique for preventing overfitting
Regularization reduces overfitting by adding a penalty to the loss function

Regularization- Dropout
● Dropout refers to ignoring units (i.e. neurons) during the training phase of
certain set of neurons which is chosen at random.
● Avoids co-dependency amongst neurons during training.
● Dropout with a given probability (20%-50%) in each weight update cycle.
● Dropout at each layer of the network has shown good results.

References
● Adam Optimization
● Andrew Ng Youtube
● Siraj Raval Youtube
● Adam Optimization
● Cross Entropy
● Deep Learning Basics
● BackPropagation

Neural networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Neural networks

Similar to Neural networks (20)

Recently uploaded

Recently uploaded (20)

Neural networks