DeepLearning.pdf

Ramesh Naidu L
Joint Director
Big Data Analytics & Machine Learning
C-DAC, Bangalore
Deep Learning

Outline
2
Introduction
What is Neural Network?
01
Deep Neural Network
DNN and its components
02
CNN
Convolutional Neural Networks & Transfer
Learning
03
Autoencoders
Latent space networks
04
RNN
Recurrent Neural Networks &
Its variants
05
Applications of DL
Applications and future of DL
06

Introduction to Deep Learning
What is Deep Learning?
01

WHAT WE LEARN?
Challenges…
Approach…

Deep learning isn’t magic.
But, it is very good at finding patterns…to make predictions.

Andrew Ng
Input
Input space Feature space
Motorbikes
“Non”-Motorbikes
Feature
Extraction
Machine
Learning
algorithm
pixel 1
pixel
2
“wheel”
“handle”
handle
wheel
What is Machine Learning?

THE BRAIN AND DEEP LEARNING
HISTORY

Dendrites
(InputVector)
Soma
(weighted Summation
Function of positive and
Negative signals from
Dendrites)
Axon (Output)
(Transmits the output
Signal to other neurons)
axon
soma
dendrites
Image Credits: https://en.wikipedia.org/wiki/Neuron
NEURAL NETWORK (BIOLOGY)

+
dendrites
soma
axon
NEURAL NETWORK (BIOLOGY)…

SYNAPSE
synapse
Credits: https://en.wikipedia.org/wiki/Neuron

•
•
•

soma
Axon
Dendrites

[Deep] Neural Networks
Neuron to Deep Networks
02

ALGORITHMS/NETWORKS
▪ Single Perceptron
▪ Multi Layer Perceptron (MLP)
▪ Convolution Neural Networks (CNN and its variants)
▪ Recurrent Neural Networks (RNN, LSTM, GRU)
▪ Auto encoders (For Dimension Reduction, Unsupervised))

WHAT HAPPENS IN A NEURON?
Problem:Whether to go to a cricket match or not?
▪ Will the weather be nice?
▪ What’s the music like?
▪ Do I have anyone to go with?
▪ Can I afford it?
▪ Do I need to write my exam?
▪ Will I like the food in the food stalls at
stadium?
Just take the first four
for simplicity

To make my decision, I would consider how important each factor was
to me, weigh them all up, and see if the result was over a certain threshold.
If so, I will go to the match! It’s a bit like the process we often use of weighing
up pros and cons to make a decision.

Give some weights…and scores…

Scores:
• weather is looking pretty good but not perfect - 3 out of 4.
• music is ok but not my favourite - 2 out of 4.
• Company - my best friend has said she’ll come with me so I know the
company will be great, - 4 out of 4.
• Money - little pricey but not completely unreasonable - 2 out of 4.

Now for the importance:
• Weather - I really want to go if it’s sunny, very imp - I give it a full 4 out of 4.
• Music - I’m happy to listen to most things, not super important - 2 out of 4.
• Company - I wouldn’t mind too much going to the match on my own, so
company can have a 2 out of 4 for importance.
• Money - I’m not too worried about money, so I can give it just a 1 out of 4.

Total Score = Sum of (Score x Weight) = 26
Threshold = 25
Score (26) >Threshold (25) → Going to Cricket Match

This number is what is known as the ‘bias’

A general form of a Neuron/Single Perceptron

A PERCEPTRON – GENERAL FORM
• The basic unit of computation in a neural network is the neuron, often called
a node or unit.
• receives input from some other nodes, or from an external source and computes
an output.
• Each input has an associated weight (w), which is assigned on the basis of its
relative importance
• The node applies a function f (defined below) to the weighted sum of its inputs

A PERCEPTRON…HOW IT WORKS?
1. All the inputs x are multiplied with their weights w. Let’s call it k.
Fig: Multiplying inputs with weights for 5 inputs

A PERCEPTRON…HOW IT WORKS?
2. Add all the multiplied values and call them Weighted Sum.
Fig:Adding with Summation

WHY DO WE NEED WEIGHTS AND BIAS?
• Weights shows the strength of the particular node/input.
• A bias value allows you to shift the activation function to the left or right.

WHY DO WE NEED WEIGHTS AND BIAS?
• Weights shows the strength of the particular node/input.
• A bias value allows you to shift the activation function to the left or right.
without the bias the ReLU-network is not able to deviate from zero at (0,0).

Why do we need an activation function??

WHY DO WE NEED “ACTIVATION FUNCTION”?
For example, if we simply want to estimate the cost of a meal, we could use a neuron based on
a linear function f(z) = wx + b. Using (z) = z and weights equal to the price of each item, the
linear neuron would take in the number of burgers, fries, and sodas and output the price of the
meal.
The above kind of linear neuron is very easy
for computation.
However, by just having an interconnected
set of linear neurons, we won’t be able to
learn complex relationships: taking multiple
linear functions and combining them still
gives us a linear function, and we could
represent the whole network by a simple
matrix transformation.

Real world data is mostly non-linear, so this is where
the Activation functions comes for help.
Activation functions introduce non-linearity into the
output of a neuron, and this ensures that we can
learn more complex functions by approximating
every non-linear function as a linear combination of
a large number of non-linear functions.
Also, activation functions help us to limit the
output to a certain finite value.

• Their main purpose is to convert a input signal of a node in a A-NN to
an output signal.
• In short, the activation functions are used to map the input between the
required values like (0, 1) or (-1, 1).
• It is also known as Transfer Function.
• They introduce non-linear properties to our Network.
• That output signal now is used as a input in the next layer in the stack.
The Activation Functions can be basically divided into 2 types-
• Linear Activation Function
• Non-linear Activation Functions

Sigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU
ACTIVATION FUNCTIONS

SIGMOID
• Sigmoid is a special case of logistic
L = 1, k = 1 and xo = 0
• Takes a real number and
transforms it to → σ(x)∈(0,1).
• Large Negative → 0
• Large Positive → 1
Sigmoidal → S-shaped
Logistic
Sigmoid
Derivative of Sigmoid (dotted Line)

SIGMOID – DISADVANTAGES
• Sigmoids saturate and kill
gradients
1) Sigmoids saturate and kill
gradients
• neuron’s activation saturates at
either tail of 0 or 1
No gradient → No signal to neuron
Too large weights → Neurons
saturate → No learning
2) Output is not zero centred
After 6 or -6, gradient of sigmoid is
becoming Zero
(Vanishing Gradient Problem)

HYPERBOLICTANGENT FUNCTION - TANH
• It is also sigmoidal
• Mathematically,
tanh(x) = 2σ(2x)−1
Transforms data to -1 to 1
tanh(x)∈(−1,1).
• Output is zero centred
• It also saturates and kills gradient

RELU (RECTIFIED LINEAR UNIT) ACTIVATION FUNCTION
R(x) = max(0,x) ➔
R(x) = 0 if x <= 0
R(x) = x if x > 0
• The ReLU is the most used
activation function
• Range: [ 0 to infinity)
• More effective than sigmoidal
functions
• Disadvantage: any negative input
given to the ReLU activation function
turns the value into zero immediately in
the graph, which in turns affects the
resulting graph by not mapping the
negative values appropriately.
R(x) = max(0,x) ➔
R(x) = 0 if x <= 0
R(x) = x if x > 0

LEAKY RELU
fix the “dying ReLU” problem by
introducing a
Small negative slope (0.01 or so)
(x) = 0.01x, if (x<0)
f(x) = x, if (x≥0)
- Does not saturate
- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.
[Mass et al., 2013]
[He et al., 2015]

RANDOMIZED RELU
• Fix the problem of dying neurons
• It introduces a small slope to keep the updates alive.
• The leak helps to increase the range of the ReLU function. Usually, the value
of a is 0.01 or so.
• When a is not 0.01 then it is called Randomized ReLU.

SOFTMAX
• Transforms a real valued vector to [0, 1]
• Output of softmax is the list of probabilities of k different classes/categories
j = 1, 2, ..., K.

SOFTMAX – CLASSIFICATION OF DIGITS
• Transforms a real valued vector to [0, 1]
• Output of softmax is the list of probabilities of 10 digits

WHICH ACTIVATION FUNCTIONTO USE?
▪ use ReLu which should only be applied to the hidden layers.
▪ And if our Model suffers form dead neurons during training we should
use leaky ReLu.
▪ It’s just that Sigmoid and Tanh should not be used nowadays due to
the vanishing Gradient Problem which causes a lots of problems
to train a Neural Network Model.
▪ Softmax is used in the last layer (Because it gives output
probabilities).

ACTIVATION FUNCTION CHEAT SHEET

SINGLE NEURON TO A NETWORK - NEURAL NETWORK

SINGLE NEURONTO A NETWORK
• The outputs of the neurons in one layer flow through to become the inputs of
the next layer, and so on.

A NETWORKTO A DEEP NETWORK
Add another hidden layer….
a neural network can be considered ‘deep’ if it contains more hidden layers...

FEEDFORWARD NEURAL NETWORK
Add another hidden layer….
a neural network can be considered ‘deep’ if it contains more hidden layers...

TRAININGTHE NETWORK
1. Randomly initialise the network weights and biases
2. Training:
• Get a ton of labelled training data (e.g. pictures of cats labelled ‘cats’ and
pictures of other stuff also correctly labelled)
• For every piece of training data, feed it into the network
3. Testing:
• Check whether the network gets it right (given a picture labelled ‘cat’, is
the result of the network also ‘cat’ or is it ‘dog’, or something else?)
• If not, how wrong was it? Or, how right was it? (What probability did it
assign to its guess?)
4. Tuning /Improving: Nudge the weights a little to increase the probability of
the network more probability getting the answer right.

EXAMPLE – CLASSIFYTHE IMAGES –TRUMP & HILLORY

TRAININGTHE NETWORK
Predict
72
%
Trump
28
% Hillory
1. Two output neurons (for each class one neuron)
2. O/P: A probability of 72% that the image is “Trump” → 72 % is insufficient

HOWTO INCREASETHE ACCURACY?
Predict
72
%
Trump
28
% Hillory
1. Two output neurons for each class
2. O/P: A probability of 72% that the image is “Trump” → 72 % is Not sufficient
Go Backwards &
Adjust weights

▪ Measure the network’s output and the correct output using the ‘loss function’.
▪ Which Loss Function??
▪ Depends on the data
▪ Example: Mean Square Error (MSE), Categorical Cross Entropy

▪ Measure the network’s output and the correct output using the ‘loss function’.
▪ Which Loss Function??
▪ Depends on the data
▪ Example: Mean Square Error
The goal of training is to find the weights and
biases that minimizes the loss function.
Image source: firsttimeprogrammer.blogspot.co.u

▪ We can plot the loss against the weights.
▪ To do this accurately, we would need to be able to visualise tons of dimensions,
to account for the many weights and biases in the network.
▪ Because I find it difficult to visualise more than three dimensions,
▪ Let’s pretend we only need to find two weight values.We can then use the third
dimension for the loss.
▪ Before training the network, the weights and biases are randomly initialised so
the loss function is likely to be high as the network will get a lot of things
wrong.
▪ Our aim is to find the lowest point of the loss function, and then see what
weight values that corresponds to. (See Next Slide).

▪ Here, we can easily see
where the lowest point is
and could happily read off
the corresponding weights
values.
▪ Unfortunately, it’s not that
easy in reality. The network
doesn’t have a nice
overview of the loss
function, it can only know
what its current loss is and
its current weights and
biases are.

HOWTO INCREASETHE ACCURACY – GRADIENT DESCENT

BASIC UNITS OF NEURAL NETWORK
Input Layer: Input variables, sometimes called the visible layer.
Hidden Layers: Layers of nodes between the input and output layers. There may be
one or more of these layers.
Output Layer: A layer of nodes that produce the output variables.
Things to describe the shape and capability of a neural network:
▪ Size: The number of nodes in the model/network.
▪ Width: The number of nodes in a specific layer.
▪ Depth: The number of layers in a neural network (We don’t count input as a layer).
▪ Capacity: The type or structure of functions that can be learned by a network
configuration. Sometimes called “representational capacity“.
▪ Architecture: The specific arrangement of the layers and nodes in the network.

Learning Rate (LR)
• Deep learning neural networks are trained using the stochastic gradient descent
algorithm.
• An optimization algorithm that estimates the error gradient for the current state of the
model using examples from the training dataset, then updates the weights of the model
using the back-propagation of errors algorithm
• Learning Rate: The amount that the weights are updated during training is referred to
as the step size or the “learning rate.”
• Specifically, the learning rate is a configurable hyperparameter used in the training of
neural networks that has a small positive value, often in the range between 0.0 and 1.0.

Learning Rate
• The learning rate controls how quickly the model is adapted to the problem.
• Smaller learning rates → Require more training epochs given the smaller changes made
to the weights each update
• Larger learning rates → Result in rapid changes and require fewer training epochs.
• We have to choose the learning rate carefully
The learning rate is perhaps the most important hyperparameter.
If you have time to tune only one hyperparameter, tune the learning rate.

Learning Rate – How to find optimal LR?
Gradually increase the learning rate after each mini batch, recording the loss at each
increment.
This gradual increase can be on either a linear or exponential scale.
a) For learning rates which are too low, the loss may decrease, but at a very shallow rate.
b) When entering the optimal learning rate zone, you'll observe a quick drop in the loss
function.
c) Increasing the learning rate further will cause an increase in the loss as the parameter
updates cause the loss to "bounce around" and even diverge from the minima.

Reference:
https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-
learning-neural-networks/

1. Simple Gradient Descent
When we initialize our weights, we are at point A in
the loss landscape.
The first thing we do is to check, out of all possible
directions in the x-y plane, moving along which
direction brings about the steepest decline in the
value of the loss function.
This is the direction we have to move in.

Problems with SGD – Local Minima
Gradient descent is driven by the
gradient, which will be zero at the
base of any minima.
Local minimum are called so since
the value of the loss function is
minimum at that point in a local
region.
A global minima is called so since the
value of the loss function is minimum
there, globally across the entire
domain of the loss function.

Optimization: Problems with SGD – Saddle Points
• A saddle point gets it's name
from the saddle of a horse
with which it resembles.
• While it's a minima in one
direction (x), it's a local
maxima in another direction,
and if the contour is flatter
towards the x direction, GD
would keep oscillating to and
fro in the y - direction, and
give us the illusion that we
have converged to a minima.
https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/

Optimization: Problems with SGD – Local Minima
A global minima is called so since the
value of the loss function is minimum
there, globally across the entire
domain of the loss function.
A Complicated Loss Landscape

Optimization: Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?
Very slow progress along shallow dimension, jitter along steep direction

Optimization: Problems with SGD – All at one place
Local Minima Saddle points
Poor Conditioning

2. SGD + Momentum
try to capture some information
regarding the previous updates a
weight has gone through before
performing the current update.
if a weight is constantly moving in a
particular direction (increasing or
decreasing), it will slowly accumulate
some “momentum” in that direction.
when it faces some resistance and actually
has to go the opposite way, it will still
continue going in the original direction for
a while because of the accumulated
momentum.

2. SGD + Momentum
Example:
• This is comparable to the actual momentum in
physics. Let’s consider a small valley and a ball.
• If we release the ball at one side of the valley,
it will continuously roll down and in that
process it is gaining momentum towards that
direction.
• Finally when the ball reaches the bottom, it
doesn’t stop there.
• Instead it starts rolling up on the opposite side
for a while since it has gained some
momentum even though gravity is telling it to
come down.

3. Nesterov Momentum
In this, what we try to do is instead of
calculating the gradients at the current
position we try to calculate it from an
approximate future position.
This is because we want to try to calculate our
gradients in a smarter way.
Just before reaching a minima, the momentum
will start reducing before it reaches it because
we are using gradients from a future point. This
results in improved stability and lesser
oscillations while converging, furthermore, in
practice it performs better than pure
momentum.

3. Nesterov Momentum
Momentum full formula expanded
Calculating approximate future weight value

4. AdaGrad (Adaptive Optimization)
In these methods, the parameters like learning rate (alpha) and momentum coefficient (n)
will not stay constant throughout the training process. Instead, these values will constantly
adapt for each and every weight in the network and hence will also change along with the
weights.
The first algorithm we are going to look into is Adagrad (Adaptive Gradient).
try to change the learning rate (alpha) for each update.
The learning rate changes during each update in such a way that it will decrease if a
weight is being updated too much in a short amount of time and it will increase if a weight
is not being updated much.

4. AdaGrad (Adaptive Optimization)
Each weight has its own cache value, which collects the squares of the gradients till the
current point.
The cache will continue to increase in value as the training progresses. Now the new
update formula is as follows:

5. RMSProp
In RMSProp the only difference lies in the cache updating strategy. In the new formula, we
introduce a new parameter, the decay rate (gamma).
gamma value is usually around 0.9 or 0.99.
Hence for each update, the square of gradients get added at a very slow rate compared to
adagrad.
This ensures that the learning rate is changing constantly based on the way the weight is
being updated, just like adagrad, but at the same time the learning rate does not decay
too quickly, hence allowing training to continue for much longer.

6. Adam
Adam is a little like combining RMSProp with Momentum. First we calculate our m value,
which will represent the momentum at the current point.
Adam
Momentum
Update Formula
The only difference between this equation and the momentum equation is that instead of the
learning rate we keep (1-Beta_1) to be multiplied with the current gradient. Next we will
calculate the accumulated cache, which is exactly the same as it is in RMSProp:
Now we can get the final
update formula:

6. Adam
• As we can observe here, we are performing accumulating the gradients by calculating
momentum and also we are constantly changing the learning rate by using the cache.
• Due to these two features, Adam usually performs better than any other optimizer out
there and is usually preferred while training neural networks.
• In the paper for Adam, the recommended parameters are 0.9 for beta_1, 0.99 for beta_2
and 1e-08 for epsilon.

- Adam is a good default choice in many cases
- SGD+Momentum with learning rate decay often
outperforms Adam by a bit, but requires more tuning
In practice:

HOW A NEURON WORKS?
1. Each neuron computes a weighted sum of the outputs of the previous
layer (which, for the inputs, we’ll call x),
2. applies an activation function, before forwarding it on to the next
layer.
3. Each neuron in a layer has its own set of weights—so while each
neuron in a layer is looking at the same inputs, their outputs will all be
different.

HOW A NETWORK WORKS?
1. we want to find a set of weights W that minimizes on the output of J(W).
2. Consider a simple double layer Network.
3. each neuron is a function of the previous one connected to it. In other words,
if one were to change the value of w1, both “hidden 1” and “hidden 2” (and
ultimately the output) neurons would change.

HOW A NETWORK WORKS?
we can mathematically formulate the output as an extensive composite
function:

INTRODUCE ERROR
we can mathematically formulate the output as an extensive composite
function:
J is a function of → input, weights, activation function and output

HOWTO MINIMIZETHE ERROR – BACK PROPAGATION
The error derivative can be used in gradient descent (as discussed
earlier) to iteratively improve upon the weight.
We compute the error derivatives w.r.t. every other weight in the
network and apply gradient descent in the same way.

BACK PROPAGATION
Find out the best “w1” by reducing the error from the last layer to first
layer….

BACK PROPAGATION
How Backpropagation works?
Refer the document – backprop.docx

Convolutional Neural Networks
CNN, Transfer Learning etc.
03

Fast-forward to today: ConvNets are everywhere
Classification Retrieval
Figurescopyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012.Reproduced with
permission.

Fast-forwardTo Today: Convnets Are Everywhere
Figurescopyright ShaoqingRen,Kaiming He,RossGirschick,JianSun,2015.Reproduced
with permission.
[Faster R-CNN: Ren, He, Girshick, Sun 2015]
Detection Segmentation
[Farabet et al.,
2012]
Figurescopyright ClementFarabet,
2012. Reproduced with permission.

CONVOLUTIONAL NEURAL NETWORK
Credits: Adam

CNN – STEP 1: BREAKTHE IMAGE INTO OVERLAPPING IMAGETILES

CNN - STEP 2: FEED EACH IMAGETILE INTO A SMALL NEURAL NETWORK

CNN - STEP 3: SAVETHE RESULTS FROM EACHTILE INTO A NEW ARRAY

CNN - STEP 4: DOWN SAMPLING
The result of Step 3 was an array that maps out which parts of the original image are the
most interesting. But that array is still pretty big.

CNN - STEP 4: DOWN SAMPLING…
To reduce the size of the array, we down sample it using an algorithm called max pooling. It
sounds fancy, but it isn’t at all!
We’ll just look at each 2x2 square of the array and keep the biggest number:

CNN – 5 STEP PIPELINE
So from start to finish, our whole five-step pipeline looks like this:

CNN – Layers/Components
▪ Convolution Layer
▪ Pooling Layer
▪ Fully Connected Layer

X or O
CNN
A two-dimensional
array of pixels

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 -1
-1 -1 1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 1 -1 -1
-1 -1 -1 1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
?

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 X -1 -1 -1 -1 X X -1
-1 X X -1 -1 X X -1 -1
-1 -1 X 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 X -1 -1
-1 -1 X X -1 -1 X X -1
-1 X X -1 -1 -1 -1 X -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 1 -1 -1 -1
-1 -1 1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 1 -1 -1
-1 -1 -1 1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
x

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1
Features match pieces of the image

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 1 1
1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 1 1
1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 1 1
1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 1 1
1 1 1
1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 1 1
1 1 1
1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 1 1
1 1 1
1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1
1 1 1
1 1 1
1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 1 -1
1 1 1
-1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 1 -1
1 1 1
-1 1 1
0.55
1 1 -1
1 1 1
-1 1 1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
=
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

1 -1 -1
-1 1 -1
-1 -1 1
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
=
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
=
=
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

One image becomes a stack of filtered images
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
1 -1 -1
-1 1 -1
-1 -1 1
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-1 -1 1
-1 1 -1
1 -1 -1
1 -1 1
-1 1 -1
1 -1 1
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1.00
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

1.00 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

1.00 0.33 0.55
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

1.00 0.33 0.55 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

1.00 0.33 0.55 0.33
0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

1.00 0.33 0.55 0.33
0.33 1.00 0.33 0.55
0.55 0.33 1.00 0.11
0.33 0.55 0.11 0.77
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

1.00 0.33 0.55 0.33
0.33 1.00 0.33 0.55
0.55 0.33 1.00 0.11
0.33 0.55 0.11 0.77
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
0.33 0.55 1.00 0.77
0.55 0.55 1.00 0.33
1.00 1.00 0.11 0.55
0.77 0.33 0.55 0.33
0.55 0.33 0.55 0.33
0.33 1.00 0.55 0.11
0.55 0.55 0.55 0.11
0.33 0.11 0.11 0.33

A stack of images becomes a stack of smaller images
1.00 0.33 0.55 0.33
0.33 1.00 0.33 0.55
0.55 0.33 1.00 0.11
0.33 0.55 0.11 0.77
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
0.33 0.55 1.00 0.77
0.55 0.55 1.00 0.33
1.00 1.00 0.11 0.55
0.77 0.33 0.55 0.33
0.55 0.33 0.55 0.33
0.33 1.00 0.55 0.11
0.55 0.55 0.55 0.11
0.33 0.11 0.11 0.33

Change everything negative to zero →ReLU

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.77

0.77 0
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

0.77 0 0.11 0.33 0.55 0 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

0.77 0 0.11 0.33 0.55 0 0.33
0 1.00 0 0.33 0 0.11 0
0.11 0 1.00 0 0.11 0 0.55
0.33 0.33 0 0.55 0 0.33 0.33
0.55 0 0.11 0 1.00 0 0.11
0 0.11 0 0.33 0 1.00 0
0.33 0 0.55 0.33 0.11 0 0.77
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77

0.77 0 0.11 0.33 0.55 0 0.33
0 1.00 0 0.33 0 0.11 0
0.11 0 1.00 0 0.11 0 0.55
0.33 0.33 0 0.55 0 0.33 0.33
0.55 0 0.11 0 1.00 0 0.11
0 0.11 0 0.33 0 1.00 0
0.33 0 0.55 0.33 0.11 0 0.77
0.33 0 0.11 0 0.11 0 0.33
0 0.55 0 0.33 0 0.55 0
0.11 0 0.55 0 0.55 0 0.11
0 0.33 0 1.00 0 0.33 0
0.11 0 0.55 0 0.55 0 0.11
0 0.55 0 0.33 0 0.55 0
0.33 0 0.11 0 0.11 0 0.33
0.33 0 0.55 0.33 0.11 0 0.77
0 0.11 0 0.33 0 1.00 0
0.55 0 0.11 0 1.00 0 0.11
0.33 0.33 0 0.55 0 0.33 0.33
0.11 0 1.00 0 0.11 0 0.55
0 1.00 0 0.33 0 0.11 0
0.77 0 0.11 0.33 0.55 0 0.33
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1.00 0.33 0.55 0.33
0.33 1.00 0.33 0.55
0.55 0.33 1.00 0.11
0.33 0.55 0.11 0.77
0.33 0.55 1.00 0.77
0.55 0.55 1.00 0.33
1.00 1.00 0.11 0.55
0.77 0.33 0.55 0.33
0.55 0.33 0.55 0.33
0.33 1.00 0.55 0.11
0.55 0.55 0.55 0.11
0.33 0.11 0.11 0.33

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1.00 0.55
0.55 1.00
0.55 1.00
1.00 0.55
1.00 0.55
0.55 0.55

1.00 0.55
0.55 1.00
0.55 1.00
1.00 0.55
1.00 0.55
0.55 0.55
1.00
0.55
0.55
1.00
1.00
0.55
0.55
0.55
0.55
1.00
1.00
0.55

X
O
1.00
0.55
0.55
1.00
1.00
0.55
0.55
0.55
0.55
1.00
1.00
0.55
Feature Vector

X
O
0.55
1.00
1.00
0.55
0.55
0.55
0.55
0.55
1.00
0.55
0.55
1.00

X
O
0.9
0.65
0.45
0.87
0.96
0.73
0.23
0.63
0.44
0.89
0.94
0.53

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
X
O

Transfer Learning
Know how to ride a motorbike ⮫ Learn how to ride a car

Pre-trained models as feature extractors

Transfer Learning with CNNs
Image
MaxPool
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
FC-1000
FC-4096
FC-4096
1. Train on Imagenet
Image
MaxPool
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
2. Small Dataset (C
classes)
FC-C
FC-4096
FC-4096
Freeze these
Reinitialize
this and train
Image
MaxPool
Conv-64
Conv-64
MaxPool
Conv-128
Conv-128
MaxPool
Conv-256
Conv-256
MaxPool
Conv-512
Conv-512
Conv-512
Conv-512
MaxPool
FC-C
FC-4096
FC-4096
Train these
With bigger
dataset, train
more layers
Freeze these
Lower learning
rate when
finetuning; 1/10 of
original LR is
good starting
point
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
Razavian et al, “CNN Features Off-the-Shelf: An
Astounding Baseline for Recognition”, CVPR Workshops
2014
3. Bigger dataset

CNN Architectures (Pre-Trained Models)
Examples: (Available in Keras)
▪ Xception
▪ VGG16
▪ VGG19
▪ ResNet, ResNetV2
▪ InceptionV3
▪ InceptionResNetV2
▪ MobileNet, MobileNetV2
▪ DenseNet
▪ NASNet
Find them @
https://keras.io/applications/

Case Study: AlexNet
[Krizhevsky et al. 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with
permission.
Architecture
:
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8

Case Study: AlexNet
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55
(N – F)/S + 1

Case Study: AlexNet
permission.
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Output volume [55x55x96]
Q: What is the total number of parameters in this layer?

Case Study: AlexNet
permission.
After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2
Q: what is the output volume size? Hint: (55-3)/2+1 = 27

Case Study: AlexNet
After CONV1: 55x55x96
After POOL1: 27x27x96
...

Case Study: AlexNet
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class
scores)
permission.
Full (simplified) AlexNet
architecture: [227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad
0 [27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
1 [13x13x384] CONV4: 384 3x3 filters at stride 1,
pad 1 [13x13x256] CONV5: 256 3x3 filters at stride
1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride
2

Case Study: AlexNet
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class
scores)
permission.
Full (simplified) AlexNet
architecture: [227x227x3] INPUT
1 [13x13x384] CONV4: 384 3x3 filters at stride 1,
pad 1 [13x13x256] CONV5: 256 3x3 filters at stride
1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride
2
Details/Retrospectives:
-first use of ReLU
-used Norm layers (not common anymore)
-heavy data augmentation
-dropout 0.5
-batch size 128
-SGD Momentum 0.9
-Learning rate 1e-2, reduced by 10
manually when val accuracy
plateaus

ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
winners
Lin etal Sanchez&
Perronnin
Krizhevsky etal
(AlexNet)
Zeiler&
Fergus
Simonyan & Szegedy etal
Zisserman (VGG) (GoogLeNet)
He et al
(ResNet)
Russakovsky etal
Shao etal Hu etal
(SENet)
shallow 8 layers 8 layers
19 layers 22 layers
152 layers 152 layers 152 layers
First CNN-based
winner

winners
Lin etal Sanchez&
Perronnin
Krizhevsky etal
(AlexNet)
Zeiler&
Fergus
He et al
(ResNet)
Russakovsky etal
Shao etal Hu etal
(SENet)
19 layers 22 layers
ZFNet: Improved
hyperparameters
over AlexNet

ZFNet [Zeiler and Fergus,
2013]
AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 16.4% -> 11.7%

winners
Lin etal Sanchez&
Perronnin
Krizhevsky et al Zeiler&
(AlexNet) Fergus
He et al
(ResNet)
Russakovsky etal
Shao etal Hu etal
(SENet)
19 layers 22 layers
Deeper Networks

Case Study: VGGNet
[Simonyan and Zisserman, 2014]
Small filters, Deeper networks
8 layers (AlexNet)
-> 16 and 19 layers (VGG16, VGG19)
Only 3x3 CONV stride 1, pad 1
and 2x2 MAX POOL stride 2
11.7% top 5 error in ILSVRC’13 (ZFNet)
-> 7.3% top 5 error in ILSVRC’14
AlexNet VGG16 VGG19

INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for
bwd) TOTAL params: 138M parameters
Note:
Most memory is
in early CONV
Most params
are in late FC
46

INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
VGG16
Common names
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for
bwd) TOTAL params: 138M parameters

Case Study: GoogLeNet
[Szegedy et al., 2014]
Inception module
- 22 layers
- Efficient “Inception” module
- No FC layers
- Only 5 million parameters!
12x less than AlexNet
- ILSVRC’14 classification winner
(6.7% top 5 error)
Deeper networks, with computational
efficiency

“Inception module”: design a
good local network topology
(network within a network) and
then stack these modules on
top of each other
Inception module

Naive Inception module
Apply parallel filter operations on
the input from previous layer:
- Multiple receptive field sizes
for convolution (1x1, 3x3,
5x5)
- Pooling operation (3x3)
Concatenate all filter outputs
together depth-wise

Apply parallel filter operations on
the input from previous layer:
- Multiple receptive field sizes
for convolution (1x1, 3x3,
5x5)
- Pooling operation (3x3)
Concatenate all filter outputs
together depth-wise
Q: What is the problem with this?
[Hint: Computational complexity]

Example:
Module
input:
28x28x256

[Szegedy et al.,
2014]
Example:
Module
input:
28x28x256
Q1: What is the output size of
the 1x1 conv, with 128 filters?

[Szegedy et al.,
2014]
Example:
Module
input:
28x28x256
Q1: What is the output size of
the 1x1 conv, with 128 filters?
28x28x12
8

[Szegedy et al.,
2014]
Example:
Module
input:
28x28x256
Q2: What are the output sizes
of all different filter operations?
28x28x12
8

[Szegedy et al.,
2014]
Example:
Module
input:
28x28x256
Q2: What are the output sizes
of all different filter operations?
28x28x12
8
28x28x19
2
28x28x9
6
28x28x25
6

[Szegedy et al.,
2014]
Example:
Module
input:
28x28x256
Q3:What is output size
after filter concatenation?
28x28x12
8
28x28x19
2
28x28x9
6
28x28x25
6

[Szegedy et al.,
2014]
Example:
Module
input:
28x28x256
28x28x12
8
28x28x19
2
28x28x9
6
28x28x25
6
28x28x(128+192+96+256) =
28x28x672

[Szegedy et al.,
2014]
Example:
Module
input:
28x28x256
28x28x12
8
28x28x19
2
28x28x9
6
28x28x25
6
28x28x(128+192+96+256) =
28x28x672
Conv Ops:
[1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x256
[5x5 conv, 96] 28x28x96x5x5x256
Total: 854M ops

[Szegedy et al.,
2014]
Example:
Module
input:
28x28x256
28x28x12
8
28x28x19
2
28x28x9
6
28x28x25
6
28x28x(128+192+96+256) =
28x28x672
Conv Ops:
[1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x256
[5x5 conv, 96] 28x28x96x5x5x256
Total: 854M ops
Very expensive compute

Lin etal Sanchez&
Perronnin
(AlexNet) Fergus
He et al
(ResNet)
Russakovsky etal
Shao etal Hu etal
(SENet)
19 layers 22 layers
winners
“Revolution of Depth”

Case Study: ResNet
[He et al., 2015]
Very deep networks using residual
connections
- 152-layer model for ImageNet
(3.57% top 5 error)
- Swept all classification and
detection competitions in
ILSVRC’15 and COCO’15!
..
.
relu
X
identity
F(x) + x
F(x)
relu
X
Residual block

Case Study: ResNet
[He et al., 2015]
relu
X
“Plain” layers
H(x)
relu
X
identity
F(x) + x
F(x)
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
relu
X
Residual block

relu
Case Study: ResNet
[He et al., 2015]
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
X
identity
F(x) + x
F(x)
relu
relu
X
Residual block
X
“Plain” layers
H(x)
Use layers to
fit residual
F(x) = H(x) - x
instead of
H(x) directly
H(x) = F(x) + x

An Analysis of Deep Neural Network Models for Practical Applications,
2017.
Comparing complexity... Inception-v4: Resnet + Inception!
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with
permission.

Comparing complexity...
2017.
VGG: Highest
memory, most
operations
permission.

2017.
GoogLeNet:
most efficient
permission.

2017.
AlexNet:
Smaller compute, still memory
heavy, lower accuracy
permission.
225

2017.
ResNet:
Moderate efficiency depending on model, highest accuracy
permission.
226

Lin et al Sanchez& Krizhevsky et al
Zeiler&
Simonyan & Szegedy etal He et al Shao etal Hu etal Russakovsky etal
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
19 layers 22 layers
winners
Network ensembling
103

[Shao et al. 2016]
- Multi-scale ensembling of Inception, Inception-Resnet, Resnet,
Wide Resnet models
Improving ResNets...
“Good Practices for Deep Feature Fusion”
104

Lin etal Sanchez&
Perronnin
(AlexNet) Fergus
He et al
(ResNet)
Russakovsky etal
Shao etal Hu etal
(SENet)
19 layers 22 layers
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Adaptive feature map reweighting
105

Improving ResNets...
Squeeze-and-Excitation Networks (SENet)
[Hu et al. 2017]
- Add a “feature recalibration” module that
learns to adaptively reweight feature maps
- Global information (global avg. pooling
layer) + 2 FC layers used to determine
feature map weights
- ILSVRC’17 classification winner (using
ResNeXt-152 as a base architecture)
106

winners
Lin et al Sanchez& Krizhevsky et al
Zeiler&
Simonyan & Szegedy etal He et al Shao etal Hu etal Russakovsky etal
19 layers 22 layers
231

winners
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky etal
Completion of the challenge:
Annual ImageNet competition no longer
held after 2017 -> now moved to Kaggle.
232

Summary: CNN Architectures
Case Studies
- AlexNet
- VGG
- GoogLeNet
- ResNet
Also....
- SENet
- NiN (Network in Network)
- Wide ResNet
- ResNeXT
- DenseNet
- FractalNet
- MobileNets
- NASNet

Summary: Data Augmentation
- Augmentation helps to prevent overfitting
- It makes network invariant to certain transformations:
translations, flip, etc
- Can be done on-the-fly
- Can be used to learn image representations when no label
datasets are available.

Recurrent Neural Networks
RNN, LSTM, GRU and BiLSTM
04

SHORTCOMINGS OF NORMAL NEURAL NETWORKS
For example, imagine you want to classify what kind of event is happening
at every point in a movie.
It’s unclear how a traditional neural network could use its reasoning
about previous events in the film to inform later ones.

RNN
RNNs address this issue. It has loops in it and allows to persist
information.
A chunk of neural network, A, looks at some input xt and outputs a value ht.
A loop allows information to be passed from one step of the network to the next.

RNN
▪ In traditional Networks, all inputs are independent of each other.
▪ RNN will make use of sequential information.
▪ If you want to predict the next word in a sentence you better know which
words came before it (A sequence of words → Predict the next word).

RNN…
▪ The most straight-forward example is perhaps a time-series of numbers,
where the task is to predict the next value given previous values.
▪ The input to the RNN at every time-step is the current value as well as a state
vector which represent what the network has “seen” at time-steps before.
▪ This state-vector is the encoded memory of the RNN, initially set to zero.

RNN…
▪ A recurrent neural network can be thought of as multiple copies of the same
network, each passing a message to a successor.

RNN –VARIOUSTYPES
(1) Processing without RNN, from fixed-sized input to
fixed-sized output
Example: Image classification

RNN –VARIOUSTYPES
(2) Sequence output
Example: Image captioning
Takes an image and outputs a sentence of words.
Single Image → Sequence of words (Caption)

RNN – IMAGE CAPTIONING
IMAGE CAPTIONING

RNN –VARIOUSTYPES
(3) Sequence of input to a single output
Example: Sentiment Prediction
Takes a sequence of words (Sentence) and
outputs a word (Sentiment).
Sentence (sequence of words) → Sentiment
(word)

RNN –VARIOUSTYPES
(4) Sequence input and sequence output
Example: MachineTranslation
An RNN reads a sentence in English and
then outputs a sentence in French.

RNN – MACHINETRANSLATION
Sequence of words (German) →
Sequence of words (English)
output only starts after we have seen the complete input, because the first word of
our translated sentences may require information captured from the complete
input sequence.

RNN –VARIOUSTYPES
(5) Synced Sequence input and output
Example: Video Classification
Label each frame of the video

RNN
x - input
w - weights
a - activation
b - output

RNN – DISADVANTAGES
Inherently sequential

Can’t persist long term dependencies
“the clouds are in the sky,”
we don’t need any further context – it’s pretty obvious the next word is going to
be sky. In such cases, where the gap between the relevant information and the place
that it’s needed is small, RNNs can learn to use the past information.
Don’t have memory cells,
X0, X1 are lost when we reached X3

Can’t persist long term dependencies
“I grew up in Bangalore… I speak fluent ? ”
Ans: Kannada
Recent information (=I speak) suggests that the next word is probably the name of
a language,
But,Which language?
we need the context of Bangalore, from further back.
It’s entirely possible for the gap between the relevant information and the point
where it is needed to become very large.
If the gap grows, RNNs cannot learn to connect the information.

LSTM – LONG SHORTTERM MEMORY
▪ RNNs cannot capture long term dependencies.
“I grew up in France… I speak fluent French.”
We need the context of France…We go for a new variant of RNN, i.e. LSTM

GATED RECURRENT UNITS (GRU)
27
0

BIDIRECTIONAL RNN (eg: BiLSTM)
27
1
 The output at time 𝑡 may not only depend
on the previous elements in the
sequence, but also future elements.
 Example: to predict amissing word in a
sequence you want to look atboth the left
and the right context.
 Two RNNs stacked on top of each other.
 Output computed based on the hidden state of
both RNNs.
 Huang Zhiheng,Xu Wei,YuKai.Bidirectional LSTM
ConditionalRandom FieldModels forSequence
T
agging.

Autoencoders
Intro, Denoising AE, Convolutional AE
05

UNSUPERVISED LEARNING
Unsupervised Learning
Data: X (no labels!)
Goal: Learn the structure of the data
(learn correlations between features)

PCA – PRINCIPAL COMPONENT ANALYSIS
- Statistical approach for data
compression and visualization
- Invented by Karl Pearson in
1901
- Weakness: linear components
only.

TRADITIONAL AUTOENCODER
▪ Unlike the PCA now we can use
activation functions to achieve non-
linearity.
▪ It has been shown that an AE
without activation functions
achieves the PCA capacity.
𝑧

LATENT SPACE REPRESENTATION
(of a quality or state) existing but not yet developed or manifest;
hidden or concealed.

Original
Latent
Representation
In this fairly simple example, let’s say our original data set is a 5 x 5 x 1 image. We will
set the latent space size to 3 x 1, which means that the compressed data point is a vector
has 3 dimensions.
Now every 5x5x1 (25 numbers) compressed data point is uniquely identified by 3
numbers

But you may be wondering, what are “similar” images and why does reducing the size of
the data make similar images “closer together” in space?
If we look at the three images above, two chairs of different color and a table, we will
easily say that the two images of the chair are more similar while the table is the most
different in shape.

But what makes the other two chairs “alike”?
Of course we will consider related features such as angles, characteristics of shapes, etc.
each such feature will be considered a data point represented in the latent space. .
In the latent space irrelevant information such as the color of the seat will be ignored
when represented in the latent space, or in other words we will remove that information.
As a result, when we reduce the size the image of both
chairs becomes less different and more similar.

AE - SIMPLE IDEA
- Given data 𝑥 (no labels) we would like to learn the
functions 𝑓 (encoder) and 𝑔 (decoder) where:
𝑓 𝑥 = 𝑎𝑐𝑡 𝑤𝑥 + 𝑏 = 𝑧
𝑔 𝑧 = 𝑎𝑐𝑡 𝑤′z + 𝑏′ = ො
𝑥
s.t ℎ 𝑥 = 𝑔 𝑓 𝑥 = ො
𝑥
where ℎ is an approximation of the identity
function.
(𝑧 is some latent
representation or code
and “act” is a non-
linearity such as the
sigmoid)
ො
𝑥
𝑓 𝑥 𝑔 𝑧
𝑥
(ො
𝑥 is 𝑥’s reconstruction)
𝑧

CONVOLUTIONAL AE
Input
(28,28,1
)
Encoder Decoder
Conv 1
16 F
@ (3,3,1)
same
C1
(28,28,16
)
M.P 1
(2,2)
same
M.P1
(14,14,16
)
Conv 2
8 F
@ (3,3,16)
same
C2
(14,14,8)
M.P 2
(2,2)
same
M.P2
(7,7,8)
Conv 3
8 F
@ (3,3,8)
same
C3
(7,7,8)
M.P 3
(2,2)
same
M.P3
(4,4,8)
Hidden
Code
D Conv 1
8 F
@ (3,3,8)
same
D.C1
(4,4,8)
U.S 1
(2,2)
U.S1
(8,8,8)
D Conv 2
8 F
@ (3,3,8)
same
D.C2
(8,8,8)
U.S 2
(2,2)
U.S2
(16,16,8)
D Conv 3
16 F
@ (3,3,8)
valid
D.C3
(14,14,16
)
U.S 3
(2,2)
U.S3
(28,28,8)
D Conv 4
1 F
@ (5,5,8)
same
D.C4
(28,28,1)
Output
* Input values are normalized
* All of the conv layers activation functions are relu except for the last conv which is sigm

DENOISING AUTOENCODERS
Intuition:
- We still aim to encode the input and to NOT mimic the identity function.
- We try to undo the effect of corruption process stochastically applied to the
input.
Encoder Decoder
Latent space representation Denoised Input
Noisy Input
A more robust model

Use Case:
- Extract robust representation for a NN classifier.
Encoder
Latent space
representation
Noisy Input

A DAE instead minimizes:
𝐿 𝑥, 𝑔 𝑓 ෤
𝑥
where ෤
𝑥 is a copy of 𝑥 that has been corrupted by some form of noise.
Instead of trying to mimic the identity function by minimizing:
𝐿 𝑥, 𝑔 𝑓 𝑥
where L is some loss function

DENOISING AUTOENCODERS - PROCESS
Taken some input 𝑥 Apply Noise ෤
𝑥

෤
𝑥 Encode And Decode
DAE
𝑔 𝑓 ෤
𝑥
DAE

ො
𝑥
DAE
𝑔 𝑓 ෤
𝑥
DAE

ො
𝑥
Compare 𝑥

DENOISING CONVOLUTIONAL AE – KERAS
- 50 epochs.
- Noise factor 0.5
- 92% accuracy on validation set.

DL Tools / Frameworks
Keras, Tensorflow
06

MACHINE LEARNINGTOOLS/FRAMEWORKS
Caffe is a deep learning framework made with expression, speed, and modularity in
mind. Caffe is developed by the Berkeley Vision and Learning Center (BVLC), as well
as community contributors and is popular for computer vision.
Caffe supports cuDNN v5 for GPU acceleration.
Supported interfaces: C, C++, Python,.
Caffe2 is a deep learning framework enabling simple and flexible deep learning. Built
on the original Caffe, Caffe2 is designed with expression, speed, and modularity in
mind, allowing for a more flexible way to organize computation.
Caffe2 supports cuDNN v5.1 for GPU acceleration.
Supported interfaces: C++, Python

The Microsoft Toolkit — previously known as CNTK— is a unified deep-learning
toolkit from Microsoft Research that makes it easy to train and combine popular
model types across multiple GPUs and servers. Microsoft Cognitive Toolkit
implements highly efficient CNN and RNN training for speech, image and text data.
Microsoft CognitiveToolkit supports cuDNN v5.1 for GPU acceleration.
Supported interfaces: Python, C++, C# and Command line interface
TensorFlow is a software library for numerical computation using data flow graphs,
developed by Google’s Machine Intelligence research organization.
TensorFlow supports cuDNN v5.1 for GPU acceleration.
Supported interfaces: C++, Python

Theano is a math expression compiler that efficiently defines, optimizes, and evaluates
mathematical expressions involving multi-dimensional arrays.
Theano supports cuDNN v5 for GPU acceleration.
Supported interfaces: Python
Keras is a minimalist, highly modular neural networks library, written in Python, and
capable of running on top of either TensorFlow or Theano. Keras was developed with
a focus on enabling fast experimentation.
cuDNN version depends on the version of TensorFlow and Theano installed with
Keras.
Supported Interfaces: Python

DeepLearning.pdf

Recommended

Recommended

More Related Content

Similar to DeepLearning.pdf

Similar to DeepLearning.pdf (20)

Recently uploaded

Recently uploaded (20)

DeepLearning.pdf