SlideShare a Scribd company logo
Ramesh Naidu L
Joint Director
Big Data Analytics & Machine Learning
C-DAC, Bangalore
Deep Learning
Outline
2
Introduction
What is Neural Network?
01
Deep Neural Network
DNN and its components
02
CNN
Convolutional Neural Networks & Transfer
Learning
03
Autoencoders
Latent space networks
04
RNN
Recurrent Neural Networks &
Its variants
05
Applications of DL
Applications and future of DL
06
Introduction to Deep Learning
What is Deep Learning?
01
INTRODUCTION
WHAT WE LEARN?
Challenges…
Approach…
Deep learning isn’t magic.
But, it is very good at finding patterns…to make predictions.
Andrew Ng
Input
Input space Feature space
Motorbikes
“Non”-Motorbikes
Feature
Extraction
Machine
Learning
algorithm
pixel 1
pixel
2
“wheel”
“handle”
handle
wheel
What is Machine Learning?
Andrew Ng
ML vs. DL
THE BRAIN AND DEEP LEARNING
HISTORY
Dendrites
(InputVector)
Soma
(weighted Summation
Function of positive and
Negative signals from
Dendrites)
Axon (Output)
(Transmits the output
Signal to other neurons)
axon
soma
dendrites
Image Credits: https://en.wikipedia.org/wiki/Neuron
NEURAL NETWORK (BIOLOGY)
+
dendrites
soma
axon
NEURAL NETWORK (BIOLOGY)…
+
dendrites
soma
axon
NEURAL NETWORK (BIOLOGY)…
SYNAPSE
synapse
Credits: https://en.wikipedia.org/wiki/Neuron
NEURAL NETWORK (BIOLOGY)…
NEURAL NETWORK (BIOLOGY)…
•
•
•
NEURAL NETWORK (BIOLOGY)…
soma
Axon
Dendrites
NEURAL NETWORK (BIOLOGY)…
soma
Axon
Dendrites
NEURAL NETWORK (BIOLOGY)…
soma
Axon
Dendrites
NEURAL NETWORK (BIOLOGY)…
NEURAL NETWORK (BIOLOGY)…
NEURAL NETWORK
DEEP NEURAL NETWORK
[Deep] Neural Networks
Neuron to Deep Networks
02
ALGORITHMS/NETWORKS
▪ Single Perceptron
▪ Multi Layer Perceptron (MLP)
▪ Convolution Neural Networks (CNN and its variants)
▪ Recurrent Neural Networks (RNN, LSTM, GRU)
▪ Auto encoders (For Dimension Reduction, Unsupervised))
SINGLE PERCEPTRON
WHAT HAPPENS IN A NEURON?
Problem:Whether to go to a cricket match or not?
▪ Will the weather be nice?
▪ What’s the music like?
▪ Do I have anyone to go with?
▪ Can I afford it?
▪ Do I need to write my exam?
▪ Will I like the food in the food stalls at
stadium?
Just take the first four
for simplicity
WHAT HAPPENS IN A NEURON?
Problem:Whether to go to a cricket match or not?
To make my decision, I would consider how important each factor was
to me, weigh them all up, and see if the result was over a certain threshold.
If so, I will go to the match! It’s a bit like the process we often use of weighing
up pros and cons to make a decision.
WHAT HAPPENS IN A NEURON?
Problem:Whether to go to a cricket match or not?
Give some weights…and scores…
WHAT HAPPENS IN A NEURON?
Problem:Whether to go to a cricket match or not?
Scores:
• weather is looking pretty good but not perfect - 3 out of 4.
• music is ok but not my favourite - 2 out of 4.
• Company - my best friend has said she’ll come with me so I know the
company will be great, - 4 out of 4.
• Money - little pricey but not completely unreasonable - 2 out of 4.
WHAT HAPPENS IN A NEURON?
Problem:Whether to go to a cricket match or not?
Now for the importance:
• Weather - I really want to go if it’s sunny, very imp - I give it a full 4 out of 4.
• Music - I’m happy to listen to most things, not super important - 2 out of 4.
• Company - I wouldn’t mind too much going to the match on my own, so
company can have a 2 out of 4 for importance.
• Money - I’m not too worried about money, so I can give it just a 1 out of 4.
WHAT HAPPENS IN A NEURON?
Problem:Whether to go to a cricket match or not?
Total Score = Sum of (Score x Weight) = 26
Threshold = 25
Score (26) >Threshold (25) → Going to Cricket Match
WHAT HAPPENS IN A NEURON?
Problem:Whether to go to a cricket match or not?
This number is what is known as the ‘bias’
WHAT HAPPENS IN A NEURON?
A general form of a Neuron/Single Perceptron
A PERCEPTRON – GENERAL FORM
• The basic unit of computation in a neural network is the neuron, often called
a node or unit.
• receives input from some other nodes, or from an external source and computes
an output.
• Each input has an associated weight (w), which is assigned on the basis of its
relative importance
• The node applies a function f (defined below) to the weighted sum of its inputs
A PERCEPTRON…HOW IT WORKS?
1. All the inputs x are multiplied with their weights w. Let’s call it k.
Fig: Multiplying inputs with weights for 5 inputs
A PERCEPTRON…HOW IT WORKS?
2. Add all the multiplied values and call them Weighted Sum.
Fig:Adding with Summation
WHY DO WE NEED WEIGHTS AND BIAS?
• Weights shows the strength of the particular node/input.
• A bias value allows you to shift the activation function to the left or right.
WHY DO WE NEED WEIGHTS AND BIAS?
• Weights shows the strength of the particular node/input.
• A bias value allows you to shift the activation function to the left or right.
without the bias the ReLU-network is not able to deviate from zero at (0,0).
Why do we need an activation function??
WHY DO WE NEED “ACTIVATION FUNCTION”?
For example, if we simply want to estimate the cost of a meal, we could use a neuron based on
a linear function f(z) = wx + b. Using (z) = z and weights equal to the price of each item, the
linear neuron would take in the number of burgers, fries, and sodas and output the price of the
meal.
The above kind of linear neuron is very easy
for computation.
However, by just having an interconnected
set of linear neurons, we won’t be able to
learn complex relationships: taking multiple
linear functions and combining them still
gives us a linear function, and we could
represent the whole network by a simple
matrix transformation.
WHY DO WE NEED “ACTIVATION FUNCTION”?
Real world data is mostly non-linear, so this is where
the Activation functions comes for help.
Activation functions introduce non-linearity into the
output of a neuron, and this ensures that we can
learn more complex functions by approximating
every non-linear function as a linear combination of
a large number of non-linear functions.
Also, activation functions help us to limit the
output to a certain finite value.
WHY DO WE NEED “ACTIVATION FUNCTION”?
• Their main purpose is to convert a input signal of a node in a A-NN to
an output signal.
• In short, the activation functions are used to map the input between the
required values like (0, 1) or (-1, 1).
• It is also known as Transfer Function.
• They introduce non-linear properties to our Network.
• That output signal now is used as a input in the next layer in the stack.
The Activation Functions can be basically divided into 2 types-
• Linear Activation Function
• Non-linear Activation Functions
Sigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU
ACTIVATION FUNCTIONS
SIGMOID
• Sigmoid is a special case of logistic
L = 1, k = 1 and xo = 0
• Takes a real number and
transforms it to → σ(x)∈(0,1).
• Large Negative → 0
• Large Positive → 1
Sigmoidal → S-shaped
Logistic
Sigmoid
Derivative of Sigmoid (dotted Line)
SIGMOID – DISADVANTAGES
• Sigmoids saturate and kill
gradients
1) Sigmoids saturate and kill
gradients
• neuron’s activation saturates at
either tail of 0 or 1
No gradient → No signal to neuron
Too large weights → Neurons
saturate → No learning
2) Output is not zero centred
After 6 or -6, gradient of sigmoid is
becoming Zero
(Vanishing Gradient Problem)
HYPERBOLICTANGENT FUNCTION - TANH
• It is also sigmoidal
• Mathematically,
tanh(x) = 2σ(2x)−1
Transforms data to -1 to 1
tanh(x)∈(−1,1).
• Output is zero centred
• It also saturates and kills gradient
RELU (RECTIFIED LINEAR UNIT) ACTIVATION FUNCTION
R(x) = max(0,x) ➔
R(x) = 0 if x <= 0
R(x) = x if x > 0
• The ReLU is the most used
activation function
• Range: [ 0 to infinity)
• More effective than sigmoidal
functions
• Disadvantage: any negative input
given to the ReLU activation function
turns the value into zero immediately in
the graph, which in turns affects the
resulting graph by not mapping the
negative values appropriately.
R(x) = max(0,x) ➔
R(x) = 0 if x <= 0
R(x) = x if x > 0
LEAKY RELU
fix the “dying ReLU” problem by
introducing a
Small negative slope (0.01 or so)
(x) = 0.01x, if (x<0)
f(x) = x, if (x≥0)
- Does not saturate
- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.
[Mass et al., 2013]
[He et al., 2015]
RANDOMIZED RELU
• Fix the problem of dying neurons
• It introduces a small slope to keep the updates alive.
• The leak helps to increase the range of the ReLU function. Usually, the value
of a is 0.01 or so.
• When a is not 0.01 then it is called Randomized ReLU.
SOFTMAX
• Transforms a real valued vector to [0, 1]
• Output of softmax is the list of probabilities of k different classes/categories
j = 1, 2, ..., K.
SOFTMAX – CLASSIFICATION OF DIGITS
• Transforms a real valued vector to [0, 1]
• Output of softmax is the list of probabilities of 10 digits
WHICH ACTIVATION FUNCTIONTO USE?
▪ use ReLu which should only be applied to the hidden layers.
▪ And if our Model suffers form dead neurons during training we should
use leaky ReLu.
▪ It’s just that Sigmoid and Tanh should not be used nowadays due to
the vanishing Gradient Problem which causes a lots of problems
to train a Neural Network Model.
▪ Softmax is used in the last layer (Because it gives output
probabilities).
ACTIVATION FUNCTION CHEAT SHEET
SINGLE NEURON TO A NETWORK - NEURAL NETWORK
SINGLE NEURONTO A NETWORK
• The outputs of the neurons in one layer flow through to become the inputs of
the next layer, and so on.
A NETWORKTO A DEEP NETWORK
Add another hidden layer….
a neural network can be considered ‘deep’ if it contains more hidden layers...
FEEDFORWARD NEURAL NETWORK
Add another hidden layer….
a neural network can be considered ‘deep’ if it contains more hidden layers...
TRAININGTHE NETWORK
1. Randomly initialise the network weights and biases
2. Training:
• Get a ton of labelled training data (e.g. pictures of cats labelled ‘cats’ and
pictures of other stuff also correctly labelled)
• For every piece of training data, feed it into the network
3. Testing:
• Check whether the network gets it right (given a picture labelled ‘cat’, is
the result of the network also ‘cat’ or is it ‘dog’, or something else?)
• If not, how wrong was it? Or, how right was it? (What probability did it
assign to its guess?)
4. Tuning /Improving: Nudge the weights a little to increase the probability of
the network more probability getting the answer right.
EXAMPLE – CLASSIFYTHE IMAGES –TRUMP & HILLORY
TRAININGTHE NETWORK
Predict
72
%
Trump
28
% Hillory
1. Two output neurons (for each class one neuron)
2. O/P: A probability of 72% that the image is “Trump” → 72 % is insufficient
HOWTO INCREASETHE ACCURACY?
Predict
72
%
Trump
28
% Hillory
1. Two output neurons for each class
2. O/P: A probability of 72% that the image is “Trump” → 72 % is Not sufficient
Go Backwards &
Adjust weights
HOWTO INCREASETHE ACCURACY?
▪ Measure the network’s output and the correct output using the ‘loss function’.
▪ Which Loss Function??
▪ Depends on the data
▪ Example: Mean Square Error (MSE), Categorical Cross Entropy
HOWTO INCREASETHE ACCURACY?
▪ Measure the network’s output and the correct output using the ‘loss function’.
▪ Which Loss Function??
▪ Depends on the data
▪ Example: Mean Square Error
The goal of training is to find the weights and
biases that minimizes the loss function.
Image source: firsttimeprogrammer.blogspot.co.u
HOWTO INCREASETHE ACCURACY?
▪ We can plot the loss against the weights.
▪ To do this accurately, we would need to be able to visualise tons of dimensions,
to account for the many weights and biases in the network.
▪ Because I find it difficult to visualise more than three dimensions,
▪ Let’s pretend we only need to find two weight values.We can then use the third
dimension for the loss.
▪ Before training the network, the weights and biases are randomly initialised so
the loss function is likely to be high as the network will get a lot of things
wrong.
▪ Our aim is to find the lowest point of the loss function, and then see what
weight values that corresponds to. (See Next Slide).
HOWTO INCREASETHE ACCURACY?
▪ Here, we can easily see
where the lowest point is
and could happily read off
the corresponding weights
values.
▪ Unfortunately, it’s not that
easy in reality. The network
doesn’t have a nice
overview of the loss
function, it can only know
what its current loss is and
its current weights and
biases are.
Image source: firsttimeprogrammer.blogspot.co.u
HOWTO INCREASETHE ACCURACY – GRADIENT DESCENT
Image source: firsttimeprogrammer.blogspot.co.u
FEW MORE CONCEPTS…
BASIC UNITS OF NEURAL NETWORK
Input Layer: Input variables, sometimes called the visible layer.
Hidden Layers: Layers of nodes between the input and output layers. There may be
one or more of these layers.
Output Layer: A layer of nodes that produce the output variables.
Things to describe the shape and capability of a neural network:
▪ Size: The number of nodes in the model/network.
▪ Width: The number of nodes in a specific layer.
▪ Depth: The number of layers in a neural network (We don’t count input as a layer).
▪ Capacity: The type or structure of functions that can be learned by a network
configuration. Sometimes called “representational capacity“.
▪ Architecture: The specific arrangement of the layers and nodes in the network.
EPOCH, BATCH, ITERATION
EPOCH, BATCH, ITERATION
Learning Rate
Learning Rate (LR)
• Deep learning neural networks are trained using the stochastic gradient descent
algorithm.
• An optimization algorithm that estimates the error gradient for the current state of the
model using examples from the training dataset, then updates the weights of the model
using the back-propagation of errors algorithm
• Learning Rate: The amount that the weights are updated during training is referred to
as the step size or the “learning rate.”
• Specifically, the learning rate is a configurable hyperparameter used in the training of
neural networks that has a small positive value, often in the range between 0.0 and 1.0.
Learning Rate
• The learning rate controls how quickly the model is adapted to the problem.
• Smaller learning rates → Require more training epochs given the smaller changes made
to the weights each update
• Larger learning rates → Result in rapid changes and require fewer training epochs.
• We have to choose the learning rate carefully
The learning rate is perhaps the most important hyperparameter.
If you have time to tune only one hyperparameter, tune the learning rate.
Learning Rate
Learning Rate – How to find optimal LR?
Gradually increase the learning rate after each mini batch, recording the loss at each
increment.
This gradual increase can be on either a linear or exponential scale.
a) For learning rates which are too low, the loss may decrease, but at a very shallow rate.
b) When entering the optimal learning rate zone, you'll observe a quick drop in the loss
function.
c) Increasing the learning rate further will cause an increase in the loss as the parameter
updates cause the loss to "bounce around" and even diverge from the minima.
Learning Rate – How to find optimal LR?
Learning Rate – How to find optimal LR?
Reference:
https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-
learning-neural-networks/
Optimizers
1. Simple Gradient Descent
When we initialize our weights, we are at point A in
the loss landscape.
The first thing we do is to check, out of all possible
directions in the x-y plane, moving along which
direction brings about the steepest decline in the
value of the loss function.
This is the direction we have to move in.
Problems with SGD – Local Minima
Gradient descent is driven by the
gradient, which will be zero at the
base of any minima.
Local minimum are called so since
the value of the loss function is
minimum at that point in a local
region.
A global minima is called so since the
value of the loss function is minimum
there, globally across the entire
domain of the loss function.
Optimization: Problems with SGD – Saddle Points
• A saddle point gets it's name
from the saddle of a horse
with which it resembles.
• While it's a minima in one
direction (x), it's a local
maxima in another direction,
and if the contour is flatter
towards the x direction, GD
would keep oscillating to and
fro in the y - direction, and
give us the illusion that we
have converged to a minima.
https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/
Optimization: Problems with SGD – Local Minima
A global minima is called so since the
value of the loss function is minimum
there, globally across the entire
domain of the loss function.
A Complicated Loss Landscape
Optimization: Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?
Very slow progress along shallow dimension, jitter along steep direction
Optimization: Problems with SGD – All at one place
Local Minima Saddle points
Poor Conditioning
2. SGD + Momentum
try to capture some information
regarding the previous updates a
weight has gone through before
performing the current update.
if a weight is constantly moving in a
particular direction (increasing or
decreasing), it will slowly accumulate
some “momentum” in that direction.
when it faces some resistance and actually
has to go the opposite way, it will still
continue going in the original direction for
a while because of the accumulated
momentum.
2. SGD + Momentum
Example:
• This is comparable to the actual momentum in
physics. Let’s consider a small valley and a ball.
• If we release the ball at one side of the valley,
it will continuously roll down and in that
process it is gaining momentum towards that
direction.
• Finally when the ball reaches the bottom, it
doesn’t stop there.
• Instead it starts rolling up on the opposite side
for a while since it has gained some
momentum even though gravity is telling it to
come down.
3. Nesterov Momentum
In this, what we try to do is instead of
calculating the gradients at the current
position we try to calculate it from an
approximate future position.
This is because we want to try to calculate our
gradients in a smarter way.
Just before reaching a minima, the momentum
will start reducing before it reaches it because
we are using gradients from a future point. This
results in improved stability and lesser
oscillations while converging, furthermore, in
practice it performs better than pure
momentum.
3. Nesterov Momentum
Momentum full formula expanded
Calculating approximate future weight value
4. AdaGrad (Adaptive Optimization)
In these methods, the parameters like learning rate (alpha) and momentum coefficient (n)
will not stay constant throughout the training process. Instead, these values will constantly
adapt for each and every weight in the network and hence will also change along with the
weights.
The first algorithm we are going to look into is Adagrad (Adaptive Gradient).
try to change the learning rate (alpha) for each update.
The learning rate changes during each update in such a way that it will decrease if a
weight is being updated too much in a short amount of time and it will increase if a weight
is not being updated much.
4. AdaGrad (Adaptive Optimization)
In these methods, the parameters like learning rate (alpha) and momentum coefficient (n)
will not stay constant throughout the training process. Instead, these values will constantly
adapt for each and every weight in the network and hence will also change along with the
weights.
The first algorithm we are going to look into is Adagrad (Adaptive Gradient).
try to change the learning rate (alpha) for each update.
The learning rate changes during each update in such a way that it will decrease if a
weight is being updated too much in a short amount of time and it will increase if a weight
is not being updated much.
4. AdaGrad (Adaptive Optimization)
Each weight has its own cache value, which collects the squares of the gradients till the
current point.
The cache will continue to increase in value as the training progresses. Now the new
update formula is as follows:
5. RMSProp
In RMSProp the only difference lies in the cache updating strategy. In the new formula, we
introduce a new parameter, the decay rate (gamma).
gamma value is usually around 0.9 or 0.99.
Hence for each update, the square of gradients get added at a very slow rate compared to
adagrad.
This ensures that the learning rate is changing constantly based on the way the weight is
being updated, just like adagrad, but at the same time the learning rate does not decay
too quickly, hence allowing training to continue for much longer.
6. Adam
Adam is a little like combining RMSProp with Momentum. First we calculate our m value,
which will represent the momentum at the current point.
Adam
Momentum
Update Formula
The only difference between this equation and the momentum equation is that instead of the
learning rate we keep (1-Beta_1) to be multiplied with the current gradient. Next we will
calculate the accumulated cache, which is exactly the same as it is in RMSProp:
Now we can get the final
update formula:
6. Adam
• As we can observe here, we are performing accumulating the gradients by calculating
momentum and also we are constantly changing the learning rate by using the cache.
• Due to these two features, Adam usually performs better than any other optimizer out
there and is usually preferred while training neural networks.
• In the paper for Adam, the recommended parameters are 0.9 for beta_1, 0.99 for beta_2
and 1e-08 for epsilon.
- Adam is a good default choice in many cases
- SGD+Momentum with learning rate decay often
outperforms Adam by a bit, but requires more tuning
In practice:
BACK PROPAGATION
HOW A NEURON WORKS?
1. Each neuron computes a weighted sum of the outputs of the previous
layer (which, for the inputs, we’ll call x),
2. applies an activation function, before forwarding it on to the next
layer.
3. Each neuron in a layer has its own set of weights—so while each
neuron in a layer is looking at the same inputs, their outputs will all be
different.
HOW A NETWORK WORKS?
1. we want to find a set of weights W that minimizes on the output of J(W).
2. Consider a simple double layer Network.
3. each neuron is a function of the previous one connected to it. In other words,
if one were to change the value of w1, both “hidden 1” and “hidden 2” (and
ultimately the output) neurons would change.
HOW A NETWORK WORKS?
we can mathematically formulate the output as an extensive composite
function:
INTRODUCE ERROR
we can mathematically formulate the output as an extensive composite
function:
J is a function of → input, weights, activation function and output
HOWTO MINIMIZETHE ERROR – BACK PROPAGATION
The error derivative can be used in gradient descent (as discussed
earlier) to iteratively improve upon the weight.
We compute the error derivatives w.r.t. every other weight in the
network and apply gradient descent in the same way.
BACK PROPAGATION
Find out the best “w1” by reducing the error from the last layer to first
layer….
BACK PROPAGATION
How Backpropagation works?
Refer the document – backprop.docx
Convolutional Neural Networks
CNN, Transfer Learning etc.
03
Fast-forward to today: ConvNets are everywhere
Classification Retrieval
Figurescopyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012.Reproduced with
permission.
Fast-forwardTo Today: Convnets Are Everywhere
Figurescopyright ShaoqingRen,Kaiming He,RossGirschick,JianSun,2015.Reproduced
with permission.
[Faster R-CNN: Ren, He, Girshick, Sun 2015]
Detection Segmentation
[Farabet et al.,
2012]
Figurescopyright ClementFarabet,
2012. Reproduced with permission.
CONVOLUTIONAL NEURAL NETWORK
Credits: Adam
CNN – STEP 1: BREAKTHE IMAGE INTO OVERLAPPING IMAGETILES
CNN - STEP 2: FEED EACH IMAGETILE INTO A SMALL NEURAL NETWORK
CNN - STEP 3: SAVETHE RESULTS FROM EACHTILE INTO A NEW ARRAY
CNN - STEP 4: DOWN SAMPLING
The result of Step 3 was an array that maps out which parts of the original image are the
most interesting. But that array is still pretty big.
CNN - STEP 4: DOWN SAMPLING…
To reduce the size of the array, we down sample it using an algorithm called max pooling. It
sounds fancy, but it isn’t at all!
We’ll just look at each 2x2 square of the array and keep the biggest number:
CNN – 5 STEP PIPELINE
So from start to finish, our whole five-step pipeline looks like this:
CNN – Layers/Components
▪ Convolution Layer
▪ Pooling Layer
▪ Fully Connected Layer
HOW CNN WORKS?
X or O
CNN
A two-dimensional
array of pixels
CNN
X
CNN
O
CNN
X
CNN
O
?
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 -1
-1 -1 1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 1 -1 -1
-1 -1 -1 1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
?
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 X -1 -1 -1 -1 X X -1
-1 X X -1 -1 X X -1 -1
-1 -1 X 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 X -1 -1
-1 -1 X X -1 -1 X X -1
-1 X X -1 -1 -1 -1 X -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 -1
-1 -1 1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 1 -1 -1
-1 -1 -1 1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
x
=
=
=
Convolution
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1
Features match pieces of the image
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1
math match
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 1
1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 1
1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 1
1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 1
1 1 1
1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 1
1 1 1
1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 1
1 1 1
1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1
1 1 1
1 1 1
1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 -1
1 1 1
-1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 -1
1 1 1
-1 1 1
0.55
1 1 -1
1 1 1
-1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
=
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
1 -1 -1
-1 1 -1
-1 -1 1
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
=
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
=
=
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
One image becomes a stack of filtered images
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
1 -1 -1
-1 1 -1
-1 -1 1
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
Pooling
1.00
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
1.00 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
1.00 0.33 0.55
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
1.00 0.33 0.55 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
1.00 0.33 0.55 0.33
0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
1.00 0.33 0.55 0.33
0.33 1.00 0.33 0.55
0.55 0.33 1.00 0.11
0.33 0.55 0.11 0.77
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
1.00 0.33 0.55 0.33
0.33 1.00 0.33 0.55
0.55 0.33 1.00 0.11
0.33 0.55 0.11 0.77
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
0.33 0.55 1.00 0.77
0.55 0.55 1.00 0.33
1.00 1.00 0.11 0.55
0.77 0.33 0.55 0.33
0.55 0.33 0.55 0.33
0.33 1.00 0.55 0.11
0.55 0.55 0.55 0.11
0.33 0.11 0.11 0.33
A stack of images becomes a stack of smaller images
1.00 0.33 0.55 0.33
0.33 1.00 0.33 0.55
0.55 0.33 1.00 0.11
0.33 0.55 0.11 0.77
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
0.33 0.55 1.00 0.77
0.55 0.55 1.00 0.33
1.00 1.00 0.11 0.55
0.77 0.33 0.55 0.33
0.55 0.33 0.55 0.33
0.33 1.00 0.55 0.11
0.55 0.55 0.55 0.11
0.33 0.11 0.11 0.33
Change everything negative to zero →ReLU
ReLU
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.77
0.77 0
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.77 0 0.11 0.33 0.55 0 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.77 0 0.11 0.33 0.55 0 0.33
0 1.00 0 0.33 0 0.11 0
0.11 0 1.00 0 0.11 0 0.55
0.33 0.33 0 0.55 0 0.33 0.33
0.55 0 0.11 0 1.00 0 0.11
0 0.11 0 0.33 0 1.00 0
0.33 0 0.55 0.33 0.11 0 0.77
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.77 0 0.11 0.33 0.55 0 0.33
0 1.00 0 0.33 0 0.11 0
0.11 0 1.00 0 0.11 0 0.55
0.33 0.33 0 0.55 0 0.33 0.33
0.55 0 0.11 0 1.00 0 0.11
0 0.11 0 0.33 0 1.00 0
0.33 0 0.55 0.33 0.11 0 0.77
0.33 0 0.11 0 0.11 0 0.33
0 0.55 0 0.33 0 0.55 0
0.11 0 0.55 0 0.55 0 0.11
0 0.33 0 1.00 0 0.33 0
0.11 0 0.55 0 0.55 0 0.11
0 0.55 0 0.33 0 0.55 0
0.33 0 0.11 0 0.11 0 0.33
0.33 0 0.55 0.33 0.11 0 0.77
0 0.11 0 0.33 0 1.00 0
0.55 0 0.11 0 1.00 0 0.11
0.33 0.33 0 0.55 0 0.33 0.33
0.11 0 1.00 0 0.11 0 0.55
0 1.00 0 0.33 0 0.11 0
0.77 0 0.11 0.33 0.55 0 0.33
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1.00 0.33 0.55 0.33
0.33 1.00 0.33 0.55
0.55 0.33 1.00 0.11
0.33 0.55 0.11 0.77
0.33 0.55 1.00 0.77
0.55 0.55 1.00 0.33
1.00 1.00 0.11 0.55
0.77 0.33 0.55 0.33
0.55 0.33 0.55 0.33
0.33 1.00 0.55 0.11
0.55 0.55 0.55 0.11
0.33 0.11 0.11 0.33
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1.00 0.55
0.55 1.00
0.55 1.00
1.00 0.55
1.00 0.55
0.55 0.55
Fully Connected Layer
1.00 0.55
0.55 1.00
0.55 1.00
1.00 0.55
1.00 0.55
0.55 0.55
1.00
0.55
0.55
1.00
1.00
0.55
0.55
0.55
0.55
1.00
1.00
0.55
X
O
1.00
0.55
0.55
1.00
1.00
0.55
0.55
0.55
0.55
1.00
1.00
0.55
Feature Vector
X
O
0.55
1.00
1.00
0.55
0.55
0.55
0.55
0.55
1.00
0.55
0.55
1.00
X
O
0.9
0.65
0.45
0.87
0.96
0.73
0.23
0.63
0.44
0.89
0.94
0.53
X
O
0.9
0.65
0.45
0.87
0.96
0.73
0.23
0.63
0.44
0.89
0.94
0.53
X
O
0.9
0.65
0.45
0.87
0.96
0.73
0.23
0.63
0.44
0.89
0.94
0.53
X
O
0.9
0.65
0.45
0.87
0.96
0.73
0.23
0.63
0.44
0.89
0.94
0.53
X
O
0.9
0.65
0.45
0.87
0.96
0.73
0.23
0.63
0.44
0.89
0.94
0.53
X
O
0.9
0.65
0.45
0.87
0.96
0.73
0.23
0.63
0.44
0.89
0.94
0.53
X
O
0.9
0.65
0.45
0.87
0.96
0.73
0.23
0.63
0.44
0.89
0.94
0.53
X
O
0.9
0.65
0.45
0.87
0.96
0.73
0.23
0.63
0.44
0.89
0.94
0.53
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
X
O
TRANSFER LEARNING
Transfer Learning
Know how to ride a motorbike ⮫ Learn how to ride a car
Pre-trained models as feature extractors
Transfer Learning with CNNs
Image
MaxPool
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
FC-1000
FC-4096
FC-4096
1. Train on Imagenet
Image
MaxPool
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
2. Small Dataset (C
classes)
FC-C
FC-4096
FC-4096
Freeze these
Reinitialize
this and train
Image
MaxPool
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
Conv-512
Conv-512
MaxPool
FC-C
FC-4096
FC-4096
Train these
With bigger
dataset, train
more layers
Freeze these
Lower learning
rate when
finetuning; 1/10 of
original LR is
good starting
point
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
Razavian et al, “CNN Features Off-the-Shelf: An
Astounding Baseline for Recognition”, CVPR Workshops
2014
3. Bigger dataset
CNN Architectures
CNN Architectures (Pre-Trained Models)
Examples: (Available in Keras)
▪ Xception
▪ VGG16
▪ VGG19
▪ ResNet, ResNetV2
▪ InceptionV3
▪ InceptionResNetV2
▪ MobileNet, MobileNetV2
▪ DenseNet
▪ NASNet
Find them @
https://keras.io/applications/
Case Study: AlexNet
[Krizhevsky et al. 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with
permission.
Architecture
:
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55
(N – F)/S + 1
Case Study: AlexNet
[Krizhevsky et al. 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with
permission.
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Output volume [55x55x96]
Q: What is the total number of parameters in this layer?
Case Study: AlexNet
[Krizhevsky et al. 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with
permission.
Input: 227x227x3 images
After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2
Q: what is the output volume size? Hint: (55-3)/2+1 = 27
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
After CONV1: 55x55x96
After POOL1: 27x27x96
...
Case Study: AlexNet
[Krizhevsky et al. 2012]
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class
scores)
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with
permission.
Full (simplified) AlexNet
architecture: [227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad
0 [27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad
2 [13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad
1 [13x13x384] CONV4: 384 3x3 filters at stride 1,
pad 1 [13x13x256] CONV5: 256 3x3 filters at stride
1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride
2
Case Study: AlexNet
[Krizhevsky et al. 2012]
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class
scores)
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with
permission.
Full (simplified) AlexNet
architecture: [227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad
0 [27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad
2 [13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad
1 [13x13x384] CONV4: 384 3x3 filters at stride 1,
pad 1 [13x13x256] CONV5: 256 3x3 filters at stride
1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride
2
Details/Retrospectives:
-first use of ReLU
-used Norm layers (not common anymore)
-heavy data augmentation
-dropout 0.5
-batch size 128
-SGD Momentum 0.9
-Learning rate 1e-2, reduced by 10
manually when val accuracy
plateaus
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
winners
Lin etal Sanchez&
Perronnin
Krizhevsky etal
(AlexNet)
Zeiler&
Fergus
Simonyan & Szegedy etal
Zisserman (VGG) (GoogLeNet)
He et al
(ResNet)
Russakovsky etal
Shao etal Hu etal
(SENet)
shallow 8 layers 8 layers
19 layers 22 layers
152 layers 152 layers 152 layers
First CNN-based
winner
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
winners
Lin etal Sanchez&
Perronnin
Krizhevsky etal
(AlexNet)
Zeiler&
Fergus
Simonyan & Szegedy etal
Zisserman (VGG) (GoogLeNet)
He et al
(ResNet)
Russakovsky etal
Shao etal Hu etal
(SENet)
shallow 8 layers 8 layers
19 layers 22 layers
152 layers 152 layers 152 layers
ZFNet: Improved
hyperparameters
over AlexNet
ZFNet [Zeiler and Fergus,
2013]
AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 16.4% -> 11.7%
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
winners
Lin etal Sanchez&
Perronnin
Krizhevsky et al Zeiler&
(AlexNet) Fergus
Simonyan & Szegedy etal
Zisserman (VGG) (GoogLeNet)
He et al
(ResNet)
Russakovsky etal
Shao etal Hu etal
(SENet)
shallow 8 layers 8 layers
19 layers 22 layers
152 layers 152 layers 152 layers
Deeper Networks
Case Study: VGGNet
[Simonyan and Zisserman, 2014]
Small filters, Deeper networks
8 layers (AlexNet)
-> 16 and 19 layers (VGG16, VGG19)
Only 3x3 CONV stride 1, pad 1
and 2x2 MAX POOL stride 2
11.7% top 5 error in ILSVRC’13 (ZFNet)
-> 7.3% top 5 error in ILSVRC’14
AlexNet VGG16 VGG19
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for
bwd) TOTAL params: 138M parameters
Note:
Most memory is
in early CONV
Most params
are in late FC
46
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
VGG16
Common names
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for
bwd) TOTAL params: 138M parameters
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
winners
Lin etal Sanchez&
Perronnin
Krizhevsky et al Zeiler&
(AlexNet) Fergus
Simonyan & Szegedy etal
Zisserman (VGG) (GoogLeNet)
He et al
(ResNet)
Russakovsky etal
Shao etal Hu etal
(SENet)
shallow 8 layers 8 layers
19 layers 22 layers
152 layers 152 layers 152 layers
Deeper Networks
Case Study: GoogLeNet
[Szegedy et al., 2014]
Inception module
- 22 layers
- Efficient “Inception” module
- No FC layers
- Only 5 million parameters!
12x less than AlexNet
- ILSVRC’14 classification winner
(6.7% top 5 error)
Deeper networks, with computational
efficiency
Case Study: GoogLeNet
[Szegedy et al., 2014]
“Inception module”: design a
good local network topology
(network within a network) and
then stack these modules on
top of each other
Inception module
Case Study: GoogLeNet
[Szegedy et al., 2014]
Naive Inception module
Apply parallel filter operations on
the input from previous layer:
- Multiple receptive field sizes
for convolution (1x1, 3x3,
5x5)
- Pooling operation (3x3)
Concatenate all filter outputs
together depth-wise
Case Study: GoogLeNet
[Szegedy et al., 2014]
Naive Inception module
Apply parallel filter operations on
the input from previous layer:
- Multiple receptive field sizes
for convolution (1x1, 3x3,
5x5)
- Pooling operation (3x3)
Concatenate all filter outputs
together depth-wise
Q: What is the problem with this?
[Hint: Computational complexity]
Case Study: GoogLeNet
[Szegedy et al., 2014]
Q: What is the problem with this?
[Hint: Computational complexity]
Example:
Module
input:
28x28x256
Naive Inception module
Case Study: GoogLeNet
[Szegedy et al.,
2014]
Q: What is the problem with this?
[Hint: Computational complexity]
Example:
Module
input:
28x28x256
Naive Inception module
Q1: What is the output size of
the 1x1 conv, with 128 filters?
Case Study: GoogLeNet
[Szegedy et al.,
2014]
Q: What is the problem with this?
[Hint: Computational complexity]
Example:
Module
input:
28x28x256
Naive Inception module
Q1: What is the output size of
the 1x1 conv, with 128 filters?
28x28x12
8
Case Study: GoogLeNet
[Szegedy et al.,
2014]
Q: What is the problem with this?
[Hint: Computational complexity]
Example:
Module
input:
28x28x256
Naive Inception module
Q2: What are the output sizes
of all different filter operations?
28x28x12
8
Case Study: GoogLeNet
[Szegedy et al.,
2014]
Q: What is the problem with this?
[Hint: Computational complexity]
Example:
Module
input:
28x28x256
Naive Inception module
Q2: What are the output sizes
of all different filter operations?
28x28x12
8
28x28x19
2
28x28x9
6
28x28x25
6
Case Study: GoogLeNet
[Szegedy et al.,
2014]
Q: What is the problem with this?
[Hint: Computational complexity]
Example:
Module
input:
28x28x256
Naive Inception module
Q3:What is output size
after filter concatenation?
28x28x12
8
28x28x19
2
28x28x9
6
28x28x25
6
Case Study: GoogLeNet
[Szegedy et al.,
2014]
Q: What is the problem with this?
[Hint: Computational complexity]
Example:
Module
input:
28x28x256
Naive Inception module
Q3:What is output size
after filter concatenation?
28x28x12
8
28x28x19
2
28x28x9
6
28x28x25
6
28x28x(128+192+96+256) =
28x28x672
Case Study: GoogLeNet
[Szegedy et al.,
2014]
Q: What is the problem with this?
[Hint: Computational complexity]
Example:
Module
input:
28x28x256
Naive Inception module
Q3:What is output size
after filter concatenation?
28x28x12
8
28x28x19
2
28x28x9
6
28x28x25
6
28x28x(128+192+96+256) =
28x28x672
Conv Ops:
[1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x256
[5x5 conv, 96] 28x28x96x5x5x256
Total: 854M ops
Case Study: GoogLeNet
[Szegedy et al.,
2014]
Naive Inception module
Q: What is the problem with this?
[Hint: Computational complexity]
Example:
Module
input:
28x28x256
Q3:What is output size
after filter concatenation?
28x28x12
8
28x28x19
2
28x28x9
6
28x28x25
6
28x28x(128+192+96+256) =
28x28x672
Conv Ops:
[1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x256
[5x5 conv, 96] 28x28x96x5x5x256
Total: 854M ops
Very expensive compute
Lin etal Sanchez&
Perronnin
Krizhevsky et al Zeiler&
(AlexNet) Fergus
Simonyan & Szegedy etal
Zisserman (VGG) (GoogLeNet)
He et al
(ResNet)
Russakovsky etal
Shao etal Hu etal
(SENet)
shallow 8 layers 8 layers
19 layers 22 layers
152 layers 152 layers 152 layers
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
winners
“Revolution of Depth”
Case Study: ResNet
[He et al., 2015]
Very deep networks using residual
connections
- 152-layer model for ImageNet
- ILSVRC’15 classification winner
(3.57% top 5 error)
- Swept all classification and
detection competitions in
ILSVRC’15 and COCO’15!
..
.
relu
X
identity
F(x) + x
F(x)
relu
X
Residual block
Case Study: ResNet
[He et al., 2015]
relu
X
“Plain” layers
H(x)
relu
X
identity
F(x) + x
F(x)
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
relu
X
Residual block
relu
Case Study: ResNet
[He et al., 2015]
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
X
identity
F(x) + x
F(x)
relu
relu
X
Residual block
X
“Plain” layers
H(x)
Use layers to
fit residual
F(x) = H(x) - x
instead of
H(x) directly
H(x) = F(x) + x
An Analysis of Deep Neural Network Models for Practical Applications,
2017.
Comparing complexity... Inception-v4: Resnet + Inception!
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with
permission.
Comparing complexity...
An Analysis of Deep Neural Network Models for Practical Applications,
2017.
VGG: Highest
memory, most
operations
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with
permission.
Comparing complexity...
An Analysis of Deep Neural Network Models for Practical Applications,
2017.
GoogLeNet:
most efficient
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with
permission.
Comparing complexity...
An Analysis of Deep Neural Network Models for Practical Applications,
2017.
AlexNet:
Smaller compute, still memory
heavy, lower accuracy
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with
permission.
225
Comparing complexity...
An Analysis of Deep Neural Network Models for Practical Applications,
2017.
ResNet:
Moderate efficiency depending on model, highest accuracy
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with
permission.
226
Lin et al Sanchez& Krizhevsky et al
Zeiler&
Simonyan & Szegedy etal He et al Shao etal Hu etal Russakovsky etal
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
shallow 8 layers 8 layers
19 layers 22 layers
152 layers 152 layers 152 layers
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
winners
Network ensembling
103
[Shao et al. 2016]
- Multi-scale ensembling of Inception, Inception-Resnet, Resnet,
Wide Resnet models
- ILSVRC’16 classification winner
Improving ResNets...
“Good Practices for Deep Feature Fusion”
104
Lin etal Sanchez&
Perronnin
Krizhevsky et al Zeiler&
(AlexNet) Fergus
Simonyan & Szegedy etal
Zisserman (VGG) (GoogLeNet)
He et al
(ResNet)
Russakovsky etal
Shao etal Hu etal
(SENet)
shallow 8 layers 8 layers
19 layers 22 layers
152 layers 152 layers 152 layers
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Adaptive feature map reweighting
105
Improving ResNets...
Squeeze-and-Excitation Networks (SENet)
[Hu et al. 2017]
- Add a “feature recalibration” module that
learns to adaptively reweight feature maps
- Global information (global avg. pooling
layer) + 2 FC layers used to determine
feature map weights
- ILSVRC’17 classification winner (using
ResNeXt-152 as a base architecture)
106
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
winners
Lin et al Sanchez& Krizhevsky et al
Zeiler&
Simonyan & Szegedy etal He et al Shao etal Hu etal Russakovsky etal
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
shallow 8 layers 8 layers
19 layers 22 layers
152 layers 152 layers 152 layers
231
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
winners
152 layers 152 layers 152 layers
19 layers 22 layers
shallow 8 layers 8 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky etal
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Completion of the challenge:
Annual ImageNet competition no longer
held after 2017 -> now moved to Kaggle.
232
Summary: CNN Architectures
Case Studies
- AlexNet
- VGG
- GoogLeNet
- ResNet
Also....
- SENet
- NiN (Network in Network)
- Wide ResNet
- ResNeXT
- DenseNet
- FractalNet
- MobileNets
- NASNet
DATA AUGMENTATION
Data Augmentation
Why Data Augmentation?
How to reduce overfitting?
How to reduce overfitting?
How to reduce overfitting?
How to reduce overfitting?
Data Augmentation
Summary: Data Augmentation
- Augmentation helps to prevent overfitting
- It makes network invariant to certain transformations:
translations, flip, etc
- Can be done on-the-fly
- Can be used to learn image representations when no label
datasets are available.
Recurrent Neural Networks
RNN, LSTM, GRU and BiLSTM
04
WHAT ISTHE NEXT SCENE?
SHORTCOMINGS OF NORMAL NEURAL NETWORKS
For example, imagine you want to classify what kind of event is happening
at every point in a movie.
It’s unclear how a traditional neural network could use its reasoning
about previous events in the film to inform later ones.
RNN
RNNs address this issue. It has loops in it and allows to persist
information.
A chunk of neural network, A, looks at some input xt and outputs a value ht.
A loop allows information to be passed from one step of the network to the next.
RNN
▪ In traditional Networks, all inputs are independent of each other.
▪ RNN will make use of sequential information.
▪ If you want to predict the next word in a sentence you better know which
words came before it (A sequence of words → Predict the next word).
RNN…
▪ The most straight-forward example is perhaps a time-series of numbers,
where the task is to predict the next value given previous values.
▪ The input to the RNN at every time-step is the current value as well as a state
vector which represent what the network has “seen” at time-steps before.
▪ This state-vector is the encoded memory of the RNN, initially set to zero.
RNN…
▪ A recurrent neural network can be thought of as multiple copies of the same
network, each passing a message to a successor.
RNN –VARIOUSTYPES
(1) Processing without RNN, from fixed-sized input to
fixed-sized output
Example: Image classification
RNN –VARIOUSTYPES
(2) Sequence output
Example: Image captioning
Takes an image and outputs a sentence of words.
Single Image → Sequence of words (Caption)
RNN – IMAGE CAPTIONING
IMAGE CAPTIONING
RNN – IMAGE CAPTIONING
IMAGE CAPTIONING
RNN – IMAGE CAPTIONING
RNN –VARIOUSTYPES
(3) Sequence of input to a single output
Example: Sentiment Prediction
Takes a sequence of words (Sentence) and
outputs a word (Sentiment).
Sentence (sequence of words) → Sentiment
(word)
RNN –VARIOUSTYPES
(4) Sequence input and sequence output
Example: MachineTranslation
An RNN reads a sentence in English and
then outputs a sentence in French.
RNN – MACHINETRANSLATION
Sequence of words (German) →
Sequence of words (English)
output only starts after we have seen the complete input, because the first word of
our translated sentences may require information captured from the complete
input sequence.
RNN –VARIOUSTYPES
(5) Synced Sequence input and output
Example: Video Classification
Label each frame of the video
RNN –VARIOUSTYPES
RNN
x - input
w - weights
a - activation
b - output
RNN – DISADVANTAGES
Inherently sequential
RNN – DISADVANTAGES
Can’t persist long term dependencies
“the clouds are in the sky,”
we don’t need any further context – it’s pretty obvious the next word is going to
be sky. In such cases, where the gap between the relevant information and the place
that it’s needed is small, RNNs can learn to use the past information.
Don’t have memory cells,
X0, X1 are lost when we reached X3
RNN – DISADVANTAGES
Can’t persist long term dependencies
“I grew up in Bangalore… I speak fluent ? ”
Ans: Kannada
Recent information (=I speak) suggests that the next word is probably the name of
a language,
But,Which language?
we need the context of Bangalore, from further back.
It’s entirely possible for the gap between the relevant information and the point
where it is needed to become very large.
If the gap grows, RNNs cannot learn to connect the information.
LSTM – LONG SHORTTERM MEMORY
▪ RNNs cannot capture long term dependencies.
“I grew up in France… I speak fluent French.”
We need the context of France…We go for a new variant of RNN, i.e. LSTM
GATED RECURRENT UNITS (GRU)
27
0
BIDIRECTIONAL RNN (eg: BiLSTM)
27
1
 The output at time 𝑡 may not only depend
on the previous elements in the
sequence, but also future elements.
 Example: to predict amissing word in a
sequence you want to look atboth the left
and the right context.
 Two RNNs stacked on top of each other.
 Output computed based on the hidden state of
both RNNs.
 Huang Zhiheng,Xu Wei,YuKai.Bidirectional LSTM
ConditionalRandom FieldModels forSequence
T
agging.
Autoencoders
Intro, Denoising AE, Convolutional AE
05
UNSUPERVISED LEARNING
Unsupervised Learning
Data: X (no labels!)
Goal: Learn the structure of the data
(learn correlations between features)
PCA – PRINCIPAL COMPONENT ANALYSIS
- Statistical approach for data
compression and visualization
- Invented by Karl Pearson in
1901
- Weakness: linear components
only.
TRADITIONAL AUTOENCODER
▪ Unlike the PCA now we can use
activation functions to achieve non-
linearity.
▪ It has been shown that an AE
without activation functions
achieves the PCA capacity.
𝑧
TRADITIONAL AUTOENCODER
LATENT SPACE REPRESENTATION
(of a quality or state) existing but not yet developed or manifest;
hidden or concealed.
LATENT SPACE REPRESENTATION
Original
Latent
Representation
In this fairly simple example, let’s say our original data set is a 5 x 5 x 1 image. We will
set the latent space size to 3 x 1, which means that the compressed data point is a vector
has 3 dimensions.
Now every 5x5x1 (25 numbers) compressed data point is uniquely identified by 3
numbers
LATENT SPACE REPRESENTATION
But you may be wondering, what are “similar” images and why does reducing the size of
the data make similar images “closer together” in space?
If we look at the three images above, two chairs of different color and a table, we will
easily say that the two images of the chair are more similar while the table is the most
different in shape.
LATENT SPACE REPRESENTATION
But what makes the other two chairs “alike”?
Of course we will consider related features such as angles, characteristics of shapes, etc.
each such feature will be considered a data point represented in the latent space. .
In the latent space irrelevant information such as the color of the seat will be ignored
when represented in the latent space, or in other words we will remove that information.
As a result, when we reduce the size the image of both
chairs becomes less different and more similar.
AE - SIMPLE IDEA
- Given data 𝑥 (no labels) we would like to learn the
functions 𝑓 (encoder) and 𝑔 (decoder) where:
𝑓 𝑥 = 𝑎𝑐𝑡 𝑤𝑥 + 𝑏 = 𝑧
𝑔 𝑧 = 𝑎𝑐𝑡 𝑤′z + 𝑏′ = ො
𝑥
s.t ℎ 𝑥 = 𝑔 𝑓 𝑥 = ො
𝑥
where ℎ is an approximation of the identity
function.
(𝑧 is some latent
representation or code
and “act” is a non-
linearity such as the
sigmoid)
ො
𝑥
𝑓 𝑥 𝑔 𝑧
𝑥
(ො
𝑥 is 𝑥’s reconstruction)
𝑧
CONVOLUTIONAL AE
Input
(28,28,1
)
Encoder Decoder
Conv 1
16 F
@ (3,3,1)
same
C1
(28,28,16
)
M.P 1
(2,2)
same
M.P1
(14,14,16
)
Conv 2
8 F
@ (3,3,16)
same
C2
(14,14,8)
M.P 2
(2,2)
same
M.P2
(7,7,8)
Conv 3
8 F
@ (3,3,8)
same
C3
(7,7,8)
M.P 3
(2,2)
same
M.P3
(4,4,8)
Hidden
Code
D Conv 1
8 F
@ (3,3,8)
same
D.C1
(4,4,8)
U.S 1
(2,2)
U.S1
(8,8,8)
D Conv 2
8 F
@ (3,3,8)
same
D.C2
(8,8,8)
U.S 2
(2,2)
U.S2
(16,16,8)
D Conv 3
16 F
@ (3,3,8)
valid
D.C3
(14,14,16
)
U.S 3
(2,2)
U.S3
(28,28,8)
D Conv 4
1 F
@ (5,5,8)
same
D.C4
(28,28,1)
Output
* Input values are normalized
* All of the conv layers activation functions are relu except for the last conv which is sigm
DENOISING AUTOENCODERS
Intuition:
- We still aim to encode the input and to NOT mimic the identity function.
- We try to undo the effect of corruption process stochastically applied to the
input.
Encoder Decoder
Latent space representation Denoised Input
Noisy Input
A more robust model
DENOISING AUTOENCODERS
Intuition:
- We still aim to encode the input and to NOT mimic the identity function.
- We try to undo the effect of corruption process stochastically applied to the
input.
Encoder Decoder
Latent space representation Denoised Input
Noisy Input
A more robust model
DENOISING AUTOENCODERS
Use Case:
- Extract robust representation for a NN classifier.
Encoder
Latent space
representation
Noisy Input
DENOISING AUTOENCODERS
A DAE instead minimizes:
𝐿 𝑥, 𝑔 𝑓 ෤
𝑥
where ෤
𝑥 is a copy of 𝑥 that has been corrupted by some form of noise.
Instead of trying to mimic the identity function by minimizing:
𝐿 𝑥, 𝑔 𝑓 𝑥
where L is some loss function
DENOISING AUTOENCODERS - PROCESS
Taken some input 𝑥 Apply Noise ෤
𝑥
DENOISING AUTOENCODERS - PROCESS
෤
𝑥 Encode And Decode
DAE
𝑔 𝑓 ෤
𝑥
DAE
ො
𝑥
DENOISING AUTOENCODERS - PROCESS
DAE
𝑔 𝑓 ෤
𝑥
DAE
ො
𝑥
DENOISING AUTOENCODERS - PROCESS
Compare 𝑥
DENOISING CONVOLUTIONAL AE – KERAS
- 50 epochs.
- Noise factor 0.5
- 92% accuracy on validation set.
DL Tools / Frameworks
Keras, Tensorflow
06
MACHINE LEARNINGTOOLS/FRAMEWORKS
Caffe is a deep learning framework made with expression, speed, and modularity in
mind. Caffe is developed by the Berkeley Vision and Learning Center (BVLC), as well
as community contributors and is popular for computer vision.
Caffe supports cuDNN v5 for GPU acceleration.
Supported interfaces: C, C++, Python,.
Caffe2 is a deep learning framework enabling simple and flexible deep learning. Built
on the original Caffe, Caffe2 is designed with expression, speed, and modularity in
mind, allowing for a more flexible way to organize computation.
Caffe2 supports cuDNN v5.1 for GPU acceleration.
Supported interfaces: C++, Python
MACHINE LEARNINGTOOLS/FRAMEWORKS
The Microsoft Toolkit — previously known as CNTK— is a unified deep-learning
toolkit from Microsoft Research that makes it easy to train and combine popular
model types across multiple GPUs and servers. Microsoft Cognitive Toolkit
implements highly efficient CNN and RNN training for speech, image and text data.
Microsoft CognitiveToolkit supports cuDNN v5.1 for GPU acceleration.
Supported interfaces: Python, C++, C# and Command line interface
TensorFlow is a software library for numerical computation using data flow graphs,
developed by Google’s Machine Intelligence research organization.
TensorFlow supports cuDNN v5.1 for GPU acceleration.
Supported interfaces: C++, Python
MACHINE LEARNINGTOOLS/FRAMEWORKS
Theano is a math expression compiler that efficiently defines, optimizes, and evaluates
mathematical expressions involving multi-dimensional arrays.
Theano supports cuDNN v5 for GPU acceleration.
Supported interfaces: Python
Keras is a minimalist, highly modular neural networks library, written in Python, and
capable of running on top of either TensorFlow or Theano. Keras was developed with
a focus on enabling fast experimentation.
cuDNN version depends on the version of TensorFlow and Theano installed with
Keras.
Supported Interfaces: Python
THANKYOU!
QUESTIONS?

More Related Content

Similar to DeepLearning.pdf

08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
Vidyasagar Bhargava
 
Deep learning simplified
Deep learning simplifiedDeep learning simplified
Deep learning simplified
Lovelyn Rose
 
Neural Networks Lec3.pptx
Neural Networks Lec3.pptxNeural Networks Lec3.pptx
Neural Networks Lec3.pptx
moah92926
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
Te-Yen Liu
 
The Introduction to Neural Networks.ppt
The Introduction to Neural Networks.pptThe Introduction to Neural Networks.ppt
The Introduction to Neural Networks.ppt
moh2020
 
14_cnn complete.pptx
14_cnn complete.pptx14_cnn complete.pptx
14_cnn complete.pptx
FaizanNadeem10
 
Neural Network_basic_Reza_Lecture_3.pptx
Neural Network_basic_Reza_Lecture_3.pptxNeural Network_basic_Reza_Lecture_3.pptx
Neural Network_basic_Reza_Lecture_3.pptx
shamimreza94
 
Multilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLPMultilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLP
Abdullah al Mamun
 
Neural Networks
Neural NetworksNeural Networks
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn
 
Unit 6: Application of AI
Unit 6: Application of AIUnit 6: Application of AI
Unit 6: Application of AI
Tekendra Nath Yogi
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
Balázs Hidasi
 
04 image transformations_ii
04 image transformations_ii04 image transformations_ii
04 image transformations_ii
ankit_ppt
 
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkActivation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural network
Gayatri Khanvilkar
 
Data Science - Part VIII - Artifical Neural Network
Data Science - Part VIII -  Artifical Neural NetworkData Science - Part VIII -  Artifical Neural Network
Data Science - Part VIII - Artifical Neural Network
Derek Kane
 
Machine Learning With Neural Networks
Machine Learning  With Neural NetworksMachine Learning  With Neural Networks
Machine Learning With Neural Networks
Knoldus Inc.
 
Unit 2 ml.pptx
Unit 2 ml.pptxUnit 2 ml.pptx
Unit 2 ml.pptx
PradeeshSAI
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
arjitkantgupta
 
Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)
Jon Lederman
 

Similar to DeepLearning.pdf (20)

08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
 
Deep learning simplified
Deep learning simplifiedDeep learning simplified
Deep learning simplified
 
Neural Networks Lec3.pptx
Neural Networks Lec3.pptxNeural Networks Lec3.pptx
Neural Networks Lec3.pptx
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
The Introduction to Neural Networks.ppt
The Introduction to Neural Networks.pptThe Introduction to Neural Networks.ppt
The Introduction to Neural Networks.ppt
 
14_cnn complete.pptx
14_cnn complete.pptx14_cnn complete.pptx
14_cnn complete.pptx
 
Neural Network_basic_Reza_Lecture_3.pptx
Neural Network_basic_Reza_Lecture_3.pptxNeural Network_basic_Reza_Lecture_3.pptx
Neural Network_basic_Reza_Lecture_3.pptx
 
Multilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLPMultilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLP
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
 
Unit 6: Application of AI
Unit 6: Application of AIUnit 6: Application of AI
Unit 6: Application of AI
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
04 image transformations_ii
04 image transformations_ii04 image transformations_ii
04 image transformations_ii
 
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkActivation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural network
 
Data Science - Part VIII - Artifical Neural Network
Data Science - Part VIII -  Artifical Neural NetworkData Science - Part VIII -  Artifical Neural Network
Data Science - Part VIII - Artifical Neural Network
 
Machine Learning With Neural Networks
Machine Learning  With Neural NetworksMachine Learning  With Neural Networks
Machine Learning With Neural Networks
 
Unit 2 ml.pptx
Unit 2 ml.pptxUnit 2 ml.pptx
Unit 2 ml.pptx
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
 
Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)Deep Learning Sample Class (Jon Lederman)
Deep Learning Sample Class (Jon Lederman)
 

Recently uploaded

官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
MiscAnnoy1
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
Madan Karki
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
Roger Rozario
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
Addu25809
 

Recently uploaded (20)

官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
 

DeepLearning.pdf

  • 1. Ramesh Naidu L Joint Director Big Data Analytics & Machine Learning C-DAC, Bangalore Deep Learning
  • 2. Outline 2 Introduction What is Neural Network? 01 Deep Neural Network DNN and its components 02 CNN Convolutional Neural Networks & Transfer Learning 03 Autoencoders Latent space networks 04 RNN Recurrent Neural Networks & Its variants 05 Applications of DL Applications and future of DL 06
  • 3. Introduction to Deep Learning What is Deep Learning? 01
  • 6. Deep learning isn’t magic. But, it is very good at finding patterns…to make predictions.
  • 7. Andrew Ng Input Input space Feature space Motorbikes “Non”-Motorbikes Feature Extraction Machine Learning algorithm pixel 1 pixel 2 “wheel” “handle” handle wheel What is Machine Learning?
  • 9. THE BRAIN AND DEEP LEARNING HISTORY
  • 10. Dendrites (InputVector) Soma (weighted Summation Function of positive and Negative signals from Dendrites) Axon (Output) (Transmits the output Signal to other neurons) axon soma dendrites Image Credits: https://en.wikipedia.org/wiki/Neuron NEURAL NETWORK (BIOLOGY)
  • 22. [Deep] Neural Networks Neuron to Deep Networks 02
  • 23. ALGORITHMS/NETWORKS ▪ Single Perceptron ▪ Multi Layer Perceptron (MLP) ▪ Convolution Neural Networks (CNN and its variants) ▪ Recurrent Neural Networks (RNN, LSTM, GRU) ▪ Auto encoders (For Dimension Reduction, Unsupervised))
  • 25. WHAT HAPPENS IN A NEURON? Problem:Whether to go to a cricket match or not? ▪ Will the weather be nice? ▪ What’s the music like? ▪ Do I have anyone to go with? ▪ Can I afford it? ▪ Do I need to write my exam? ▪ Will I like the food in the food stalls at stadium? Just take the first four for simplicity
  • 26. WHAT HAPPENS IN A NEURON? Problem:Whether to go to a cricket match or not? To make my decision, I would consider how important each factor was to me, weigh them all up, and see if the result was over a certain threshold. If so, I will go to the match! It’s a bit like the process we often use of weighing up pros and cons to make a decision.
  • 27. WHAT HAPPENS IN A NEURON? Problem:Whether to go to a cricket match or not? Give some weights…and scores…
  • 28. WHAT HAPPENS IN A NEURON? Problem:Whether to go to a cricket match or not? Scores: • weather is looking pretty good but not perfect - 3 out of 4. • music is ok but not my favourite - 2 out of 4. • Company - my best friend has said she’ll come with me so I know the company will be great, - 4 out of 4. • Money - little pricey but not completely unreasonable - 2 out of 4.
  • 29. WHAT HAPPENS IN A NEURON? Problem:Whether to go to a cricket match or not? Now for the importance: • Weather - I really want to go if it’s sunny, very imp - I give it a full 4 out of 4. • Music - I’m happy to listen to most things, not super important - 2 out of 4. • Company - I wouldn’t mind too much going to the match on my own, so company can have a 2 out of 4 for importance. • Money - I’m not too worried about money, so I can give it just a 1 out of 4.
  • 30. WHAT HAPPENS IN A NEURON? Problem:Whether to go to a cricket match or not? Total Score = Sum of (Score x Weight) = 26 Threshold = 25 Score (26) >Threshold (25) → Going to Cricket Match
  • 31. WHAT HAPPENS IN A NEURON? Problem:Whether to go to a cricket match or not? This number is what is known as the ‘bias’
  • 32. WHAT HAPPENS IN A NEURON? A general form of a Neuron/Single Perceptron
  • 33. A PERCEPTRON – GENERAL FORM • The basic unit of computation in a neural network is the neuron, often called a node or unit. • receives input from some other nodes, or from an external source and computes an output. • Each input has an associated weight (w), which is assigned on the basis of its relative importance • The node applies a function f (defined below) to the weighted sum of its inputs
  • 34. A PERCEPTRON…HOW IT WORKS? 1. All the inputs x are multiplied with their weights w. Let’s call it k. Fig: Multiplying inputs with weights for 5 inputs
  • 35. A PERCEPTRON…HOW IT WORKS? 2. Add all the multiplied values and call them Weighted Sum. Fig:Adding with Summation
  • 36. WHY DO WE NEED WEIGHTS AND BIAS? • Weights shows the strength of the particular node/input. • A bias value allows you to shift the activation function to the left or right.
  • 37. WHY DO WE NEED WEIGHTS AND BIAS? • Weights shows the strength of the particular node/input. • A bias value allows you to shift the activation function to the left or right. without the bias the ReLU-network is not able to deviate from zero at (0,0).
  • 38. Why do we need an activation function??
  • 39. WHY DO WE NEED “ACTIVATION FUNCTION”? For example, if we simply want to estimate the cost of a meal, we could use a neuron based on a linear function f(z) = wx + b. Using (z) = z and weights equal to the price of each item, the linear neuron would take in the number of burgers, fries, and sodas and output the price of the meal. The above kind of linear neuron is very easy for computation. However, by just having an interconnected set of linear neurons, we won’t be able to learn complex relationships: taking multiple linear functions and combining them still gives us a linear function, and we could represent the whole network by a simple matrix transformation.
  • 40. WHY DO WE NEED “ACTIVATION FUNCTION”? Real world data is mostly non-linear, so this is where the Activation functions comes for help. Activation functions introduce non-linearity into the output of a neuron, and this ensures that we can learn more complex functions by approximating every non-linear function as a linear combination of a large number of non-linear functions. Also, activation functions help us to limit the output to a certain finite value.
  • 41. WHY DO WE NEED “ACTIVATION FUNCTION”? • Their main purpose is to convert a input signal of a node in a A-NN to an output signal. • In short, the activation functions are used to map the input between the required values like (0, 1) or (-1, 1). • It is also known as Transfer Function. • They introduce non-linear properties to our Network. • That output signal now is used as a input in the next layer in the stack. The Activation Functions can be basically divided into 2 types- • Linear Activation Function • Non-linear Activation Functions
  • 43. SIGMOID • Sigmoid is a special case of logistic L = 1, k = 1 and xo = 0 • Takes a real number and transforms it to → σ(x)∈(0,1). • Large Negative → 0 • Large Positive → 1 Sigmoidal → S-shaped Logistic Sigmoid Derivative of Sigmoid (dotted Line)
  • 44. SIGMOID – DISADVANTAGES • Sigmoids saturate and kill gradients 1) Sigmoids saturate and kill gradients • neuron’s activation saturates at either tail of 0 or 1 No gradient → No signal to neuron Too large weights → Neurons saturate → No learning 2) Output is not zero centred After 6 or -6, gradient of sigmoid is becoming Zero (Vanishing Gradient Problem)
  • 45. HYPERBOLICTANGENT FUNCTION - TANH • It is also sigmoidal • Mathematically, tanh(x) = 2σ(2x)−1 Transforms data to -1 to 1 tanh(x)∈(−1,1). • Output is zero centred • It also saturates and kills gradient
  • 46. RELU (RECTIFIED LINEAR UNIT) ACTIVATION FUNCTION R(x) = max(0,x) ➔ R(x) = 0 if x <= 0 R(x) = x if x > 0 • The ReLU is the most used activation function • Range: [ 0 to infinity) • More effective than sigmoidal functions • Disadvantage: any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turns affects the resulting graph by not mapping the negative values appropriately. R(x) = max(0,x) ➔ R(x) = 0 if x <= 0 R(x) = x if x > 0
  • 47. LEAKY RELU fix the “dying ReLU” problem by introducing a Small negative slope (0.01 or so) (x) = 0.01x, if (x<0) f(x) = x, if (x≥0) - Does not saturate - Computationally efficient - Converges much faster than sigmoid/tanh in practice! (e.g. 6x) - will not “die”. [Mass et al., 2013] [He et al., 2015]
  • 48. RANDOMIZED RELU • Fix the problem of dying neurons • It introduces a small slope to keep the updates alive. • The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so. • When a is not 0.01 then it is called Randomized ReLU.
  • 49. SOFTMAX • Transforms a real valued vector to [0, 1] • Output of softmax is the list of probabilities of k different classes/categories j = 1, 2, ..., K.
  • 50. SOFTMAX – CLASSIFICATION OF DIGITS • Transforms a real valued vector to [0, 1] • Output of softmax is the list of probabilities of 10 digits
  • 51. WHICH ACTIVATION FUNCTIONTO USE? ▪ use ReLu which should only be applied to the hidden layers. ▪ And if our Model suffers form dead neurons during training we should use leaky ReLu. ▪ It’s just that Sigmoid and Tanh should not be used nowadays due to the vanishing Gradient Problem which causes a lots of problems to train a Neural Network Model. ▪ Softmax is used in the last layer (Because it gives output probabilities).
  • 53. SINGLE NEURON TO A NETWORK - NEURAL NETWORK
  • 54. SINGLE NEURONTO A NETWORK • The outputs of the neurons in one layer flow through to become the inputs of the next layer, and so on.
  • 55. A NETWORKTO A DEEP NETWORK Add another hidden layer…. a neural network can be considered ‘deep’ if it contains more hidden layers...
  • 56. FEEDFORWARD NEURAL NETWORK Add another hidden layer…. a neural network can be considered ‘deep’ if it contains more hidden layers...
  • 57. TRAININGTHE NETWORK 1. Randomly initialise the network weights and biases 2. Training: • Get a ton of labelled training data (e.g. pictures of cats labelled ‘cats’ and pictures of other stuff also correctly labelled) • For every piece of training data, feed it into the network 3. Testing: • Check whether the network gets it right (given a picture labelled ‘cat’, is the result of the network also ‘cat’ or is it ‘dog’, or something else?) • If not, how wrong was it? Or, how right was it? (What probability did it assign to its guess?) 4. Tuning /Improving: Nudge the weights a little to increase the probability of the network more probability getting the answer right.
  • 58. EXAMPLE – CLASSIFYTHE IMAGES –TRUMP & HILLORY
  • 59. TRAININGTHE NETWORK Predict 72 % Trump 28 % Hillory 1. Two output neurons (for each class one neuron) 2. O/P: A probability of 72% that the image is “Trump” → 72 % is insufficient
  • 60. HOWTO INCREASETHE ACCURACY? Predict 72 % Trump 28 % Hillory 1. Two output neurons for each class 2. O/P: A probability of 72% that the image is “Trump” → 72 % is Not sufficient Go Backwards & Adjust weights
  • 61. HOWTO INCREASETHE ACCURACY? ▪ Measure the network’s output and the correct output using the ‘loss function’. ▪ Which Loss Function?? ▪ Depends on the data ▪ Example: Mean Square Error (MSE), Categorical Cross Entropy
  • 62. HOWTO INCREASETHE ACCURACY? ▪ Measure the network’s output and the correct output using the ‘loss function’. ▪ Which Loss Function?? ▪ Depends on the data ▪ Example: Mean Square Error The goal of training is to find the weights and biases that minimizes the loss function. Image source: firsttimeprogrammer.blogspot.co.u
  • 63. HOWTO INCREASETHE ACCURACY? ▪ We can plot the loss against the weights. ▪ To do this accurately, we would need to be able to visualise tons of dimensions, to account for the many weights and biases in the network. ▪ Because I find it difficult to visualise more than three dimensions, ▪ Let’s pretend we only need to find two weight values.We can then use the third dimension for the loss. ▪ Before training the network, the weights and biases are randomly initialised so the loss function is likely to be high as the network will get a lot of things wrong. ▪ Our aim is to find the lowest point of the loss function, and then see what weight values that corresponds to. (See Next Slide).
  • 64. HOWTO INCREASETHE ACCURACY? ▪ Here, we can easily see where the lowest point is and could happily read off the corresponding weights values. ▪ Unfortunately, it’s not that easy in reality. The network doesn’t have a nice overview of the loss function, it can only know what its current loss is and its current weights and biases are. Image source: firsttimeprogrammer.blogspot.co.u
  • 65. HOWTO INCREASETHE ACCURACY – GRADIENT DESCENT Image source: firsttimeprogrammer.blogspot.co.u
  • 67. BASIC UNITS OF NEURAL NETWORK Input Layer: Input variables, sometimes called the visible layer. Hidden Layers: Layers of nodes between the input and output layers. There may be one or more of these layers. Output Layer: A layer of nodes that produce the output variables. Things to describe the shape and capability of a neural network: ▪ Size: The number of nodes in the model/network. ▪ Width: The number of nodes in a specific layer. ▪ Depth: The number of layers in a neural network (We don’t count input as a layer). ▪ Capacity: The type or structure of functions that can be learned by a network configuration. Sometimes called “representational capacity“. ▪ Architecture: The specific arrangement of the layers and nodes in the network.
  • 71. Learning Rate (LR) • Deep learning neural networks are trained using the stochastic gradient descent algorithm. • An optimization algorithm that estimates the error gradient for the current state of the model using examples from the training dataset, then updates the weights of the model using the back-propagation of errors algorithm • Learning Rate: The amount that the weights are updated during training is referred to as the step size or the “learning rate.” • Specifically, the learning rate is a configurable hyperparameter used in the training of neural networks that has a small positive value, often in the range between 0.0 and 1.0.
  • 72. Learning Rate • The learning rate controls how quickly the model is adapted to the problem. • Smaller learning rates → Require more training epochs given the smaller changes made to the weights each update • Larger learning rates → Result in rapid changes and require fewer training epochs. • We have to choose the learning rate carefully The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate.
  • 74. Learning Rate – How to find optimal LR? Gradually increase the learning rate after each mini batch, recording the loss at each increment. This gradual increase can be on either a linear or exponential scale. a) For learning rates which are too low, the loss may decrease, but at a very shallow rate. b) When entering the optimal learning rate zone, you'll observe a quick drop in the loss function. c) Increasing the learning rate further will cause an increase in the loss as the parameter updates cause the loss to "bounce around" and even diverge from the minima.
  • 75. Learning Rate – How to find optimal LR?
  • 76. Learning Rate – How to find optimal LR? Reference: https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep- learning-neural-networks/
  • 78. 1. Simple Gradient Descent When we initialize our weights, we are at point A in the loss landscape. The first thing we do is to check, out of all possible directions in the x-y plane, moving along which direction brings about the steepest decline in the value of the loss function. This is the direction we have to move in.
  • 79. Problems with SGD – Local Minima Gradient descent is driven by the gradient, which will be zero at the base of any minima. Local minimum are called so since the value of the loss function is minimum at that point in a local region. A global minima is called so since the value of the loss function is minimum there, globally across the entire domain of the loss function.
  • 80. Optimization: Problems with SGD – Saddle Points • A saddle point gets it's name from the saddle of a horse with which it resembles. • While it's a minima in one direction (x), it's a local maxima in another direction, and if the contour is flatter towards the x direction, GD would keep oscillating to and fro in the y - direction, and give us the illusion that we have converged to a minima. https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/
  • 81. Optimization: Problems with SGD – Local Minima A global minima is called so since the value of the loss function is minimum there, globally across the entire domain of the loss function. A Complicated Loss Landscape
  • 82. Optimization: Problems with SGD What if loss changes quickly in one direction and slowly in another? What does gradient descent do? Very slow progress along shallow dimension, jitter along steep direction
  • 83. Optimization: Problems with SGD – All at one place Local Minima Saddle points Poor Conditioning
  • 84. 2. SGD + Momentum try to capture some information regarding the previous updates a weight has gone through before performing the current update. if a weight is constantly moving in a particular direction (increasing or decreasing), it will slowly accumulate some “momentum” in that direction. when it faces some resistance and actually has to go the opposite way, it will still continue going in the original direction for a while because of the accumulated momentum.
  • 85. 2. SGD + Momentum Example: • This is comparable to the actual momentum in physics. Let’s consider a small valley and a ball. • If we release the ball at one side of the valley, it will continuously roll down and in that process it is gaining momentum towards that direction. • Finally when the ball reaches the bottom, it doesn’t stop there. • Instead it starts rolling up on the opposite side for a while since it has gained some momentum even though gravity is telling it to come down.
  • 86. 3. Nesterov Momentum In this, what we try to do is instead of calculating the gradients at the current position we try to calculate it from an approximate future position. This is because we want to try to calculate our gradients in a smarter way. Just before reaching a minima, the momentum will start reducing before it reaches it because we are using gradients from a future point. This results in improved stability and lesser oscillations while converging, furthermore, in practice it performs better than pure momentum.
  • 87. 3. Nesterov Momentum Momentum full formula expanded Calculating approximate future weight value
  • 88. 4. AdaGrad (Adaptive Optimization) In these methods, the parameters like learning rate (alpha) and momentum coefficient (n) will not stay constant throughout the training process. Instead, these values will constantly adapt for each and every weight in the network and hence will also change along with the weights. The first algorithm we are going to look into is Adagrad (Adaptive Gradient). try to change the learning rate (alpha) for each update. The learning rate changes during each update in such a way that it will decrease if a weight is being updated too much in a short amount of time and it will increase if a weight is not being updated much.
  • 89. 4. AdaGrad (Adaptive Optimization) In these methods, the parameters like learning rate (alpha) and momentum coefficient (n) will not stay constant throughout the training process. Instead, these values will constantly adapt for each and every weight in the network and hence will also change along with the weights. The first algorithm we are going to look into is Adagrad (Adaptive Gradient). try to change the learning rate (alpha) for each update. The learning rate changes during each update in such a way that it will decrease if a weight is being updated too much in a short amount of time and it will increase if a weight is not being updated much.
  • 90. 4. AdaGrad (Adaptive Optimization) Each weight has its own cache value, which collects the squares of the gradients till the current point. The cache will continue to increase in value as the training progresses. Now the new update formula is as follows:
  • 91. 5. RMSProp In RMSProp the only difference lies in the cache updating strategy. In the new formula, we introduce a new parameter, the decay rate (gamma). gamma value is usually around 0.9 or 0.99. Hence for each update, the square of gradients get added at a very slow rate compared to adagrad. This ensures that the learning rate is changing constantly based on the way the weight is being updated, just like adagrad, but at the same time the learning rate does not decay too quickly, hence allowing training to continue for much longer.
  • 92. 6. Adam Adam is a little like combining RMSProp with Momentum. First we calculate our m value, which will represent the momentum at the current point. Adam Momentum Update Formula The only difference between this equation and the momentum equation is that instead of the learning rate we keep (1-Beta_1) to be multiplied with the current gradient. Next we will calculate the accumulated cache, which is exactly the same as it is in RMSProp: Now we can get the final update formula:
  • 93. 6. Adam • As we can observe here, we are performing accumulating the gradients by calculating momentum and also we are constantly changing the learning rate by using the cache. • Due to these two features, Adam usually performs better than any other optimizer out there and is usually preferred while training neural networks. • In the paper for Adam, the recommended parameters are 0.9 for beta_1, 0.99 for beta_2 and 1e-08 for epsilon.
  • 94. - Adam is a good default choice in many cases - SGD+Momentum with learning rate decay often outperforms Adam by a bit, but requires more tuning In practice:
  • 96. HOW A NEURON WORKS? 1. Each neuron computes a weighted sum of the outputs of the previous layer (which, for the inputs, we’ll call x), 2. applies an activation function, before forwarding it on to the next layer. 3. Each neuron in a layer has its own set of weights—so while each neuron in a layer is looking at the same inputs, their outputs will all be different.
  • 97. HOW A NETWORK WORKS? 1. we want to find a set of weights W that minimizes on the output of J(W). 2. Consider a simple double layer Network. 3. each neuron is a function of the previous one connected to it. In other words, if one were to change the value of w1, both “hidden 1” and “hidden 2” (and ultimately the output) neurons would change.
  • 98. HOW A NETWORK WORKS? we can mathematically formulate the output as an extensive composite function:
  • 99. INTRODUCE ERROR we can mathematically formulate the output as an extensive composite function: J is a function of → input, weights, activation function and output
  • 100. HOWTO MINIMIZETHE ERROR – BACK PROPAGATION The error derivative can be used in gradient descent (as discussed earlier) to iteratively improve upon the weight. We compute the error derivatives w.r.t. every other weight in the network and apply gradient descent in the same way.
  • 101. BACK PROPAGATION Find out the best “w1” by reducing the error from the last layer to first layer….
  • 102. BACK PROPAGATION How Backpropagation works? Refer the document – backprop.docx
  • 103. Convolutional Neural Networks CNN, Transfer Learning etc. 03
  • 104. Fast-forward to today: ConvNets are everywhere Classification Retrieval Figurescopyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012.Reproduced with permission.
  • 105. Fast-forwardTo Today: Convnets Are Everywhere Figurescopyright ShaoqingRen,Kaiming He,RossGirschick,JianSun,2015.Reproduced with permission. [Faster R-CNN: Ren, He, Girshick, Sun 2015] Detection Segmentation [Farabet et al., 2012] Figurescopyright ClementFarabet, 2012. Reproduced with permission.
  • 107. CNN – STEP 1: BREAKTHE IMAGE INTO OVERLAPPING IMAGETILES
  • 108. CNN - STEP 2: FEED EACH IMAGETILE INTO A SMALL NEURAL NETWORK
  • 109. CNN - STEP 3: SAVETHE RESULTS FROM EACHTILE INTO A NEW ARRAY
  • 110. CNN - STEP 4: DOWN SAMPLING The result of Step 3 was an array that maps out which parts of the original image are the most interesting. But that array is still pretty big.
  • 111. CNN - STEP 4: DOWN SAMPLING… To reduce the size of the array, we down sample it using an algorithm called max pooling. It sounds fancy, but it isn’t at all! We’ll just look at each 2x2 square of the array and keep the biggest number:
  • 112. CNN – 5 STEP PIPELINE So from start to finish, our whole five-step pipeline looks like this:
  • 113. CNN – Layers/Components ▪ Convolution Layer ▪ Pooling Layer ▪ Fully Connected Layer
  • 115. X or O CNN A two-dimensional array of pixels
  • 118. ?
  • 119. -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ?
  • 120. -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 X -1 -1 -1 -1 X X -1 -1 X X -1 -1 X X -1 -1 -1 -1 X 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 X -1 -1 -1 -1 X X -1 -1 X X -1 -1 X X -1 -1 -1 -1 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 121. -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 x
  • 122. = = =
  • 124. 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 1 -1 1 -1 -1 1 -1 1 -1 1 -1 1 -1 1 Features match pieces of the image
  • 125. 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 1 -1 1 -1 -1 1 -1 1 -1 1 -1 1 -1 1
  • 126. 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 1 -1 1 -1 -1 1 -1 1 -1 1 -1 1 -1 1
  • 127. 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 1 -1 1 -1 -1 1 -1 1 -1 1 -1 1 -1 1
  • 128. 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 1 -1 1 -1 -1 1 -1 1 -1 1 -1 1 -1 1
  • 129. 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 1 -1 1 -1 -1 1 -1 1 -1 1 -1 1 -1 1
  • 130.
  • 132. 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 133. 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 134. 1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 135. 1 1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 136. 1 1 1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 137. 1 1 1 1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 138. 1 1 1 1 1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 139. 1 1 1 1 1 1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 140. 1 1 1 1 1 1 1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 141. 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 142. 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 143. 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 144. 1 1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 145. 1 1 -1 1 1 1 -1 1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 146. 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 -1 1 1 0.55 1 1 -1 1 1 1 -1 1 1
  • 147. 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 = 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
  • 148. 1 -1 -1 -1 1 -1 -1 -1 1 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 = 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 -1 -1 1 -1 1 -1 1 -1 -1 1 -1 1 -1 1 -1 1 -1 1 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 = = -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 149. One image becomes a stack of filtered images 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 1 -1 -1 -1 1 -1 -1 -1 1 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 -1 -1 1 -1 1 -1 1 -1 -1 1 -1 1 -1 1 -1 1 -1 1 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 150. 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  • 152.
  • 153. 1.00 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
  • 154. 1.00 0.33 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
  • 155. 1.00 0.33 0.55 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
  • 156. 1.00 0.33 0.55 0.33 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
  • 157. 1.00 0.33 0.55 0.33 0.33 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
  • 158. 1.00 0.33 0.55 0.33 0.33 1.00 0.33 0.55 0.55 0.33 1.00 0.11 0.33 0.55 0.11 0.77 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
  • 159. 1.00 0.33 0.55 0.33 0.33 1.00 0.33 0.55 0.55 0.33 1.00 0.11 0.33 0.55 0.11 0.77 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 0.33 0.55 1.00 0.77 0.55 0.55 1.00 0.33 1.00 1.00 0.11 0.55 0.77 0.33 0.55 0.33 0.55 0.33 0.55 0.33 0.33 1.00 0.55 0.11 0.55 0.55 0.55 0.11 0.33 0.11 0.11 0.33
  • 160. A stack of images becomes a stack of smaller images 1.00 0.33 0.55 0.33 0.33 1.00 0.33 0.55 0.55 0.33 1.00 0.11 0.33 0.55 0.11 0.77 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 0.33 0.55 1.00 0.77 0.55 0.55 1.00 0.33 1.00 1.00 0.11 0.55 0.77 0.33 0.55 0.33 0.55 0.33 0.55 0.33 0.33 1.00 0.55 0.11 0.55 0.55 0.55 0.11 0.33 0.11 0.11 0.33
  • 161. Change everything negative to zero →ReLU
  • 162. ReLU
  • 163. 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.77
  • 164. 0.77 0 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
  • 165. 0.77 0 0.11 0.33 0.55 0 0.33 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
  • 166. 0.77 0 0.11 0.33 0.55 0 0.33 0 1.00 0 0.33 0 0.11 0 0.11 0 1.00 0 0.11 0 0.55 0.33 0.33 0 0.55 0 0.33 0.33 0.55 0 0.11 0 1.00 0 0.11 0 0.11 0 0.33 0 1.00 0 0.33 0 0.55 0.33 0.11 0 0.77 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
  • 167. 0.77 0 0.11 0.33 0.55 0 0.33 0 1.00 0 0.33 0 0.11 0 0.11 0 1.00 0 0.11 0 0.55 0.33 0.33 0 0.55 0 0.33 0.33 0.55 0 0.11 0 1.00 0 0.11 0 0.11 0 0.33 0 1.00 0 0.33 0 0.55 0.33 0.11 0 0.77 0.33 0 0.11 0 0.11 0 0.33 0 0.55 0 0.33 0 0.55 0 0.11 0 0.55 0 0.55 0 0.11 0 0.33 0 1.00 0 0.33 0 0.11 0 0.55 0 0.55 0 0.11 0 0.55 0 0.33 0 0.55 0 0.33 0 0.11 0 0.11 0 0.33 0.33 0 0.55 0.33 0.11 0 0.77 0 0.11 0 0.33 0 1.00 0 0.55 0 0.11 0 1.00 0 0.11 0.33 0.33 0 0.55 0 0.33 0.33 0.11 0 1.00 0 0.11 0 0.55 0 1.00 0 0.33 0 0.11 0 0.77 0 0.11 0.33 0.55 0 0.33 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 -0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11 0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 -0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
  • 168. -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1.00 0.33 0.55 0.33 0.33 1.00 0.33 0.55 0.55 0.33 1.00 0.11 0.33 0.55 0.11 0.77 0.33 0.55 1.00 0.77 0.55 0.55 1.00 0.33 1.00 1.00 0.11 0.55 0.77 0.33 0.55 0.33 0.55 0.33 0.55 0.33 0.33 1.00 0.55 0.11 0.55 0.55 0.55 0.11 0.33 0.11 0.11 0.33
  • 169. -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1.00 0.55 0.55 1.00 0.55 1.00 1.00 0.55 1.00 0.55 0.55 0.55
  • 171. 1.00 0.55 0.55 1.00 0.55 1.00 1.00 0.55 1.00 0.55 0.55 0.55 1.00 0.55 0.55 1.00 1.00 0.55 0.55 0.55 0.55 1.00 1.00 0.55
  • 182. -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 X O
  • 184. Transfer Learning Know how to ride a motorbike ⮫ Learn how to ride a car
  • 185.
  • 186. Pre-trained models as feature extractors
  • 187. Transfer Learning with CNNs Image MaxPool Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 FC-1000 FC-4096 FC-4096 1. Train on Imagenet Image MaxPool Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 2. Small Dataset (C classes) FC-C FC-4096 FC-4096 Freeze these Reinitialize this and train Image MaxPool Conv-64 Conv-64 MaxPool Conv-128 Conv-128 MaxPool Conv-256 Conv-256 MaxPool Conv-512 Conv-512 Conv-512 Conv-512 MaxPool FC-C FC-4096 FC-4096 Train these With bigger dataset, train more layers Freeze these Lower learning rate when finetuning; 1/10 of original LR is good starting point Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014 Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014 3. Bigger dataset
  • 189. CNN Architectures (Pre-Trained Models) Examples: (Available in Keras) ▪ Xception ▪ VGG16 ▪ VGG19 ▪ ResNet, ResNetV2 ▪ InceptionV3 ▪ InceptionResNetV2 ▪ MobileNet, MobileNetV2 ▪ DenseNet ▪ NASNet Find them @ https://keras.io/applications/
  • 190. Case Study: AlexNet [Krizhevsky et al. 2012] Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Architecture : CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8
  • 191. Case Study: AlexNet [Krizhevsky et al. 2012] Input: 227x227x3 images First layer (CONV1): 96 11x11 filters applied at stride 4 => Q: what is the output volume size? Hint: (227-11)/4+1 = 55 (N – F)/S + 1
  • 192. Case Study: AlexNet [Krizhevsky et al. 2012] Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Input: 227x227x3 images First layer (CONV1): 96 11x11 filters applied at stride 4 => Output volume [55x55x96] Q: What is the total number of parameters in this layer?
  • 193. Case Study: AlexNet [Krizhevsky et al. 2012] Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Input: 227x227x3 images After CONV1: 55x55x96 Second layer (POOL1): 3x3 filters applied at stride 2 Q: what is the output volume size? Hint: (55-3)/2+1 = 27
  • 194. Case Study: AlexNet [Krizhevsky et al. 2012] Input: 227x227x3 images After CONV1: 55x55x96 After POOL1: 27x27x96 ...
  • 195. Case Study: AlexNet [Krizhevsky et al. 2012] [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores) Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2
  • 196. Case Study: AlexNet [Krizhevsky et al. 2012] [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores) Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2 Details/Retrospectives: -first use of ReLU -used Norm layers (not common anymore) -heavy data augmentation -dropout 0.5 -batch size 128 -SGD Momentum 0.9 -Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus
  • 197. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Lin etal Sanchez& Perronnin Krizhevsky etal (AlexNet) Zeiler& Fergus Simonyan & Szegedy etal Zisserman (VGG) (GoogLeNet) He et al (ResNet) Russakovsky etal Shao etal Hu etal (SENet) shallow 8 layers 8 layers 19 layers 22 layers 152 layers 152 layers 152 layers First CNN-based winner
  • 198. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Lin etal Sanchez& Perronnin Krizhevsky etal (AlexNet) Zeiler& Fergus Simonyan & Szegedy etal Zisserman (VGG) (GoogLeNet) He et al (ResNet) Russakovsky etal Shao etal Hu etal (SENet) shallow 8 layers 8 layers 19 layers 22 layers 152 layers 152 layers 152 layers ZFNet: Improved hyperparameters over AlexNet
  • 199. ZFNet [Zeiler and Fergus, 2013] AlexNet but: CONV1: change from (11x11 stride 4) to (7x7 stride 2) CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512 ImageNet top 5 error: 16.4% -> 11.7%
  • 200. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Lin etal Sanchez& Perronnin Krizhevsky et al Zeiler& (AlexNet) Fergus Simonyan & Szegedy etal Zisserman (VGG) (GoogLeNet) He et al (ResNet) Russakovsky etal Shao etal Hu etal (SENet) shallow 8 layers 8 layers 19 layers 22 layers 152 layers 152 layers 152 layers Deeper Networks
  • 201. Case Study: VGGNet [Simonyan and Zisserman, 2014] Small filters, Deeper networks 8 layers (AlexNet) -> 16 and 19 layers (VGG16, VGG19) Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2 11.7% top 5 error in ILSVRC’13 (ZFNet) -> 7.3% top 5 error in ILSVRC’14 AlexNet VGG16 VGG19
  • 202. INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases) CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K params: 0 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters Note: Most memory is in early CONV Most params are in late FC 46
  • 203. INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases) VGG16 Common names CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K params: 0 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters
  • 204. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Lin etal Sanchez& Perronnin Krizhevsky et al Zeiler& (AlexNet) Fergus Simonyan & Szegedy etal Zisserman (VGG) (GoogLeNet) He et al (ResNet) Russakovsky etal Shao etal Hu etal (SENet) shallow 8 layers 8 layers 19 layers 22 layers 152 layers 152 layers 152 layers Deeper Networks
  • 205. Case Study: GoogLeNet [Szegedy et al., 2014] Inception module - 22 layers - Efficient “Inception” module - No FC layers - Only 5 million parameters! 12x less than AlexNet - ILSVRC’14 classification winner (6.7% top 5 error) Deeper networks, with computational efficiency
  • 206. Case Study: GoogLeNet [Szegedy et al., 2014] “Inception module”: design a good local network topology (network within a network) and then stack these modules on top of each other Inception module
  • 207. Case Study: GoogLeNet [Szegedy et al., 2014] Naive Inception module Apply parallel filter operations on the input from previous layer: - Multiple receptive field sizes for convolution (1x1, 3x3, 5x5) - Pooling operation (3x3) Concatenate all filter outputs together depth-wise
  • 208. Case Study: GoogLeNet [Szegedy et al., 2014] Naive Inception module Apply parallel filter operations on the input from previous layer: - Multiple receptive field sizes for convolution (1x1, 3x3, 5x5) - Pooling operation (3x3) Concatenate all filter outputs together depth-wise Q: What is the problem with this? [Hint: Computational complexity]
  • 209. Case Study: GoogLeNet [Szegedy et al., 2014] Q: What is the problem with this? [Hint: Computational complexity] Example: Module input: 28x28x256 Naive Inception module
  • 210. Case Study: GoogLeNet [Szegedy et al., 2014] Q: What is the problem with this? [Hint: Computational complexity] Example: Module input: 28x28x256 Naive Inception module Q1: What is the output size of the 1x1 conv, with 128 filters?
  • 211. Case Study: GoogLeNet [Szegedy et al., 2014] Q: What is the problem with this? [Hint: Computational complexity] Example: Module input: 28x28x256 Naive Inception module Q1: What is the output size of the 1x1 conv, with 128 filters? 28x28x12 8
  • 212. Case Study: GoogLeNet [Szegedy et al., 2014] Q: What is the problem with this? [Hint: Computational complexity] Example: Module input: 28x28x256 Naive Inception module Q2: What are the output sizes of all different filter operations? 28x28x12 8
  • 213. Case Study: GoogLeNet [Szegedy et al., 2014] Q: What is the problem with this? [Hint: Computational complexity] Example: Module input: 28x28x256 Naive Inception module Q2: What are the output sizes of all different filter operations? 28x28x12 8 28x28x19 2 28x28x9 6 28x28x25 6
  • 214. Case Study: GoogLeNet [Szegedy et al., 2014] Q: What is the problem with this? [Hint: Computational complexity] Example: Module input: 28x28x256 Naive Inception module Q3:What is output size after filter concatenation? 28x28x12 8 28x28x19 2 28x28x9 6 28x28x25 6
  • 215. Case Study: GoogLeNet [Szegedy et al., 2014] Q: What is the problem with this? [Hint: Computational complexity] Example: Module input: 28x28x256 Naive Inception module Q3:What is output size after filter concatenation? 28x28x12 8 28x28x19 2 28x28x9 6 28x28x25 6 28x28x(128+192+96+256) = 28x28x672
  • 216. Case Study: GoogLeNet [Szegedy et al., 2014] Q: What is the problem with this? [Hint: Computational complexity] Example: Module input: 28x28x256 Naive Inception module Q3:What is output size after filter concatenation? 28x28x12 8 28x28x19 2 28x28x9 6 28x28x25 6 28x28x(128+192+96+256) = 28x28x672 Conv Ops: [1x1 conv, 128] 28x28x128x1x1x256 [3x3 conv, 192] 28x28x192x3x3x256 [5x5 conv, 96] 28x28x96x5x5x256 Total: 854M ops
  • 217. Case Study: GoogLeNet [Szegedy et al., 2014] Naive Inception module Q: What is the problem with this? [Hint: Computational complexity] Example: Module input: 28x28x256 Q3:What is output size after filter concatenation? 28x28x12 8 28x28x19 2 28x28x9 6 28x28x25 6 28x28x(128+192+96+256) = 28x28x672 Conv Ops: [1x1 conv, 128] 28x28x128x1x1x256 [3x3 conv, 192] 28x28x192x3x3x256 [5x5 conv, 96] 28x28x96x5x5x256 Total: 854M ops Very expensive compute
  • 218. Lin etal Sanchez& Perronnin Krizhevsky et al Zeiler& (AlexNet) Fergus Simonyan & Szegedy etal Zisserman (VGG) (GoogLeNet) He et al (ResNet) Russakovsky etal Shao etal Hu etal (SENet) shallow 8 layers 8 layers 19 layers 22 layers 152 layers 152 layers 152 layers ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners “Revolution of Depth”
  • 219. Case Study: ResNet [He et al., 2015] Very deep networks using residual connections - 152-layer model for ImageNet - ILSVRC’15 classification winner (3.57% top 5 error) - Swept all classification and detection competitions in ILSVRC’15 and COCO’15! .. . relu X identity F(x) + x F(x) relu X Residual block
  • 220. Case Study: ResNet [He et al., 2015] relu X “Plain” layers H(x) relu X identity F(x) + x F(x) Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping relu X Residual block
  • 221. relu Case Study: ResNet [He et al., 2015] Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping X identity F(x) + x F(x) relu relu X Residual block X “Plain” layers H(x) Use layers to fit residual F(x) = H(x) - x instead of H(x) directly H(x) = F(x) + x
  • 222. An Analysis of Deep Neural Network Models for Practical Applications, 2017. Comparing complexity... Inception-v4: Resnet + Inception! Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
  • 223. Comparing complexity... An Analysis of Deep Neural Network Models for Practical Applications, 2017. VGG: Highest memory, most operations Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
  • 224. Comparing complexity... An Analysis of Deep Neural Network Models for Practical Applications, 2017. GoogLeNet: most efficient Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
  • 225. Comparing complexity... An Analysis of Deep Neural Network Models for Practical Applications, 2017. AlexNet: Smaller compute, still memory heavy, lower accuracy Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission. 225
  • 226. Comparing complexity... An Analysis of Deep Neural Network Models for Practical Applications, 2017. ResNet: Moderate efficiency depending on model, highest accuracy Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission. 226
  • 227. Lin et al Sanchez& Krizhevsky et al Zeiler& Simonyan & Szegedy etal He et al Shao etal Hu etal Russakovsky etal Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet) shallow 8 layers 8 layers 19 layers 22 layers 152 layers 152 layers 152 layers ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Network ensembling 103
  • 228. [Shao et al. 2016] - Multi-scale ensembling of Inception, Inception-Resnet, Resnet, Wide Resnet models - ILSVRC’16 classification winner Improving ResNets... “Good Practices for Deep Feature Fusion” 104
  • 229. Lin etal Sanchez& Perronnin Krizhevsky et al Zeiler& (AlexNet) Fergus Simonyan & Szegedy etal Zisserman (VGG) (GoogLeNet) He et al (ResNet) Russakovsky etal Shao etal Hu etal (SENet) shallow 8 layers 8 layers 19 layers 22 layers 152 layers 152 layers 152 layers ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Adaptive feature map reweighting 105
  • 230. Improving ResNets... Squeeze-and-Excitation Networks (SENet) [Hu et al. 2017] - Add a “feature recalibration” module that learns to adaptively reweight feature maps - Global information (global avg. pooling layer) + 2 FC layers used to determine feature map weights - ILSVRC’17 classification winner (using ResNeXt-152 as a base architecture) 106
  • 231. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Lin et al Sanchez& Krizhevsky et al Zeiler& Simonyan & Szegedy etal He et al Shao etal Hu etal Russakovsky etal Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet) shallow 8 layers 8 layers 19 layers 22 layers 152 layers 152 layers 152 layers 231
  • 232. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners 152 layers 152 layers 152 layers 19 layers 22 layers shallow 8 layers 8 layers Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky etal Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet) Completion of the challenge: Annual ImageNet competition no longer held after 2017 -> now moved to Kaggle. 232
  • 233. Summary: CNN Architectures Case Studies - AlexNet - VGG - GoogLeNet - ResNet Also.... - SENet - NiN (Network in Network) - Wide ResNet - ResNeXT - DenseNet - FractalNet - MobileNets - NASNet
  • 237. How to reduce overfitting?
  • 238. How to reduce overfitting?
  • 239. How to reduce overfitting?
  • 240. How to reduce overfitting?
  • 241.
  • 243.
  • 244.
  • 245.
  • 246.
  • 247. Summary: Data Augmentation - Augmentation helps to prevent overfitting - It makes network invariant to certain transformations: translations, flip, etc - Can be done on-the-fly - Can be used to learn image representations when no label datasets are available.
  • 248. Recurrent Neural Networks RNN, LSTM, GRU and BiLSTM 04
  • 249. WHAT ISTHE NEXT SCENE?
  • 250. SHORTCOMINGS OF NORMAL NEURAL NETWORKS For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.
  • 251. RNN RNNs address this issue. It has loops in it and allows to persist information. A chunk of neural network, A, looks at some input xt and outputs a value ht. A loop allows information to be passed from one step of the network to the next.
  • 252. RNN ▪ In traditional Networks, all inputs are independent of each other. ▪ RNN will make use of sequential information. ▪ If you want to predict the next word in a sentence you better know which words came before it (A sequence of words → Predict the next word).
  • 253. RNN… ▪ The most straight-forward example is perhaps a time-series of numbers, where the task is to predict the next value given previous values. ▪ The input to the RNN at every time-step is the current value as well as a state vector which represent what the network has “seen” at time-steps before. ▪ This state-vector is the encoded memory of the RNN, initially set to zero.
  • 254. RNN… ▪ A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor.
  • 255. RNN –VARIOUSTYPES (1) Processing without RNN, from fixed-sized input to fixed-sized output Example: Image classification
  • 256. RNN –VARIOUSTYPES (2) Sequence output Example: Image captioning Takes an image and outputs a sentence of words. Single Image → Sequence of words (Caption)
  • 257. RNN – IMAGE CAPTIONING IMAGE CAPTIONING
  • 258. RNN – IMAGE CAPTIONING IMAGE CAPTIONING
  • 259. RNN – IMAGE CAPTIONING
  • 260. RNN –VARIOUSTYPES (3) Sequence of input to a single output Example: Sentiment Prediction Takes a sequence of words (Sentence) and outputs a word (Sentiment). Sentence (sequence of words) → Sentiment (word)
  • 261. RNN –VARIOUSTYPES (4) Sequence input and sequence output Example: MachineTranslation An RNN reads a sentence in English and then outputs a sentence in French.
  • 262. RNN – MACHINETRANSLATION Sequence of words (German) → Sequence of words (English) output only starts after we have seen the complete input, because the first word of our translated sentences may require information captured from the complete input sequence.
  • 263. RNN –VARIOUSTYPES (5) Synced Sequence input and output Example: Video Classification Label each frame of the video
  • 265. RNN x - input w - weights a - activation b - output
  • 267. RNN – DISADVANTAGES Can’t persist long term dependencies “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information. Don’t have memory cells, X0, X1 are lost when we reached X3
  • 268. RNN – DISADVANTAGES Can’t persist long term dependencies “I grew up in Bangalore… I speak fluent ? ” Ans: Kannada Recent information (=I speak) suggests that the next word is probably the name of a language, But,Which language? we need the context of Bangalore, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large. If the gap grows, RNNs cannot learn to connect the information.
  • 269. LSTM – LONG SHORTTERM MEMORY ▪ RNNs cannot capture long term dependencies. “I grew up in France… I speak fluent French.” We need the context of France…We go for a new variant of RNN, i.e. LSTM
  • 270. GATED RECURRENT UNITS (GRU) 27 0
  • 271. BIDIRECTIONAL RNN (eg: BiLSTM) 27 1  The output at time 𝑡 may not only depend on the previous elements in the sequence, but also future elements.  Example: to predict amissing word in a sequence you want to look atboth the left and the right context.  Two RNNs stacked on top of each other.  Output computed based on the hidden state of both RNNs.  Huang Zhiheng,Xu Wei,YuKai.Bidirectional LSTM ConditionalRandom FieldModels forSequence T agging.
  • 272. Autoencoders Intro, Denoising AE, Convolutional AE 05
  • 273. UNSUPERVISED LEARNING Unsupervised Learning Data: X (no labels!) Goal: Learn the structure of the data (learn correlations between features)
  • 274. PCA – PRINCIPAL COMPONENT ANALYSIS - Statistical approach for data compression and visualization - Invented by Karl Pearson in 1901 - Weakness: linear components only.
  • 275. TRADITIONAL AUTOENCODER ▪ Unlike the PCA now we can use activation functions to achieve non- linearity. ▪ It has been shown that an AE without activation functions achieves the PCA capacity. 𝑧
  • 277. LATENT SPACE REPRESENTATION (of a quality or state) existing but not yet developed or manifest; hidden or concealed.
  • 278. LATENT SPACE REPRESENTATION Original Latent Representation In this fairly simple example, let’s say our original data set is a 5 x 5 x 1 image. We will set the latent space size to 3 x 1, which means that the compressed data point is a vector has 3 dimensions. Now every 5x5x1 (25 numbers) compressed data point is uniquely identified by 3 numbers
  • 279. LATENT SPACE REPRESENTATION But you may be wondering, what are “similar” images and why does reducing the size of the data make similar images “closer together” in space? If we look at the three images above, two chairs of different color and a table, we will easily say that the two images of the chair are more similar while the table is the most different in shape.
  • 280. LATENT SPACE REPRESENTATION But what makes the other two chairs “alike”? Of course we will consider related features such as angles, characteristics of shapes, etc. each such feature will be considered a data point represented in the latent space. . In the latent space irrelevant information such as the color of the seat will be ignored when represented in the latent space, or in other words we will remove that information. As a result, when we reduce the size the image of both chairs becomes less different and more similar.
  • 281. AE - SIMPLE IDEA - Given data 𝑥 (no labels) we would like to learn the functions 𝑓 (encoder) and 𝑔 (decoder) where: 𝑓 𝑥 = 𝑎𝑐𝑡 𝑤𝑥 + 𝑏 = 𝑧 𝑔 𝑧 = 𝑎𝑐𝑡 𝑤′z + 𝑏′ = ො 𝑥 s.t ℎ 𝑥 = 𝑔 𝑓 𝑥 = ො 𝑥 where ℎ is an approximation of the identity function. (𝑧 is some latent representation or code and “act” is a non- linearity such as the sigmoid) ො 𝑥 𝑓 𝑥 𝑔 𝑧 𝑥 (ො 𝑥 is 𝑥’s reconstruction) 𝑧
  • 282. CONVOLUTIONAL AE Input (28,28,1 ) Encoder Decoder Conv 1 16 F @ (3,3,1) same C1 (28,28,16 ) M.P 1 (2,2) same M.P1 (14,14,16 ) Conv 2 8 F @ (3,3,16) same C2 (14,14,8) M.P 2 (2,2) same M.P2 (7,7,8) Conv 3 8 F @ (3,3,8) same C3 (7,7,8) M.P 3 (2,2) same M.P3 (4,4,8) Hidden Code D Conv 1 8 F @ (3,3,8) same D.C1 (4,4,8) U.S 1 (2,2) U.S1 (8,8,8) D Conv 2 8 F @ (3,3,8) same D.C2 (8,8,8) U.S 2 (2,2) U.S2 (16,16,8) D Conv 3 16 F @ (3,3,8) valid D.C3 (14,14,16 ) U.S 3 (2,2) U.S3 (28,28,8) D Conv 4 1 F @ (5,5,8) same D.C4 (28,28,1) Output * Input values are normalized * All of the conv layers activation functions are relu except for the last conv which is sigm
  • 283. DENOISING AUTOENCODERS Intuition: - We still aim to encode the input and to NOT mimic the identity function. - We try to undo the effect of corruption process stochastically applied to the input. Encoder Decoder Latent space representation Denoised Input Noisy Input A more robust model
  • 284. DENOISING AUTOENCODERS Intuition: - We still aim to encode the input and to NOT mimic the identity function. - We try to undo the effect of corruption process stochastically applied to the input. Encoder Decoder Latent space representation Denoised Input Noisy Input A more robust model
  • 285. DENOISING AUTOENCODERS Use Case: - Extract robust representation for a NN classifier. Encoder Latent space representation Noisy Input
  • 286. DENOISING AUTOENCODERS A DAE instead minimizes: 𝐿 𝑥, 𝑔 𝑓 ෤ 𝑥 where ෤ 𝑥 is a copy of 𝑥 that has been corrupted by some form of noise. Instead of trying to mimic the identity function by minimizing: 𝐿 𝑥, 𝑔 𝑓 𝑥 where L is some loss function
  • 287. DENOISING AUTOENCODERS - PROCESS Taken some input 𝑥 Apply Noise ෤ 𝑥
  • 288. DENOISING AUTOENCODERS - PROCESS ෤ 𝑥 Encode And Decode DAE 𝑔 𝑓 ෤ 𝑥 DAE
  • 289. ො 𝑥 DENOISING AUTOENCODERS - PROCESS DAE 𝑔 𝑓 ෤ 𝑥 DAE
  • 290. ො 𝑥 DENOISING AUTOENCODERS - PROCESS Compare 𝑥
  • 291. DENOISING CONVOLUTIONAL AE – KERAS - 50 epochs. - Noise factor 0.5 - 92% accuracy on validation set.
  • 292. DL Tools / Frameworks Keras, Tensorflow 06
  • 293. MACHINE LEARNINGTOOLS/FRAMEWORKS Caffe is a deep learning framework made with expression, speed, and modularity in mind. Caffe is developed by the Berkeley Vision and Learning Center (BVLC), as well as community contributors and is popular for computer vision. Caffe supports cuDNN v5 for GPU acceleration. Supported interfaces: C, C++, Python,. Caffe2 is a deep learning framework enabling simple and flexible deep learning. Built on the original Caffe, Caffe2 is designed with expression, speed, and modularity in mind, allowing for a more flexible way to organize computation. Caffe2 supports cuDNN v5.1 for GPU acceleration. Supported interfaces: C++, Python
  • 294. MACHINE LEARNINGTOOLS/FRAMEWORKS The Microsoft Toolkit — previously known as CNTK— is a unified deep-learning toolkit from Microsoft Research that makes it easy to train and combine popular model types across multiple GPUs and servers. Microsoft Cognitive Toolkit implements highly efficient CNN and RNN training for speech, image and text data. Microsoft CognitiveToolkit supports cuDNN v5.1 for GPU acceleration. Supported interfaces: Python, C++, C# and Command line interface TensorFlow is a software library for numerical computation using data flow graphs, developed by Google’s Machine Intelligence research organization. TensorFlow supports cuDNN v5.1 for GPU acceleration. Supported interfaces: C++, Python
  • 295. MACHINE LEARNINGTOOLS/FRAMEWORKS Theano is a math expression compiler that efficiently defines, optimizes, and evaluates mathematical expressions involving multi-dimensional arrays. Theano supports cuDNN v5 for GPU acceleration. Supported interfaces: Python Keras is a minimalist, highly modular neural networks library, written in Python, and capable of running on top of either TensorFlow or Theano. Keras was developed with a focus on enabling fast experimentation. cuDNN version depends on the version of TensorFlow and Theano installed with Keras. Supported Interfaces: Python