Deep learning

Content
• Image enhancement
• Spatial domain enhancement
• Point operations, Spatial operations
• Digital Negative
• Contrast Stretching
• Thresholding
• Gray level slicing
• Bit plane slicing
• Dynamic range compression
• Histogram modeling
• Adaptive Histogram Equalization
• Frequency domain enhancement
• Fourier Transforms / Spectral (Frequency) Analysis
• Color Image
• Color Image Enhancement

AI Vs Machine Learning Vs Deep Learning

Artificial Neural Networks — Mapping the Human Brain
• The human brain consists of neurons or nerve cells which transmit and process
the information received from our senses.
• Many such nerve cells are arranged together in our brain to form a network of
nerves. These nerves pass electrical impulses i.e the excitation from one
neuron to the other.
• The dendrites receive the impulse from the terminal button or synapse of an
adjoining neuron.
• Dendrites carry the impulse to the nucleus of the nerve cell (soma).
• Here , the electrical impulse is processed and then passed on to the axon.
• The axon is longer branch among the dendrites which carries the impulse from
the soma to the synapse.
• The synapse then , passes the impulse to dendrites of the second neuron.
• Thus, a complex network of neurons is created in the human brain.
Artificial Neuron Construction
• Like above, the neurons are created artificially on a computer. Connecting
many such artificial neurons creates an artificial neural network.
• Artificial neuron works similar to a neuron present in human brain.
• The data in the network flows through each neuron by a connection.
• Every connection has a specific weight by which the flow of data is
regulated.

Construction of an Artificial Neuron
• The same concept of the network of neurons is used in machine learning
algorithms.
• In this case , the neurons are created artificially on a computer . Connecting many
such artificial neurons creates an artificial neural network.
• The working of an artificial neuron is similar to that of a neuron present in our
brain.
• The data in the network flows through each neuron by a connection.
• Every connection has a specific weight by which the flow of data is regulated.
Neural networks Applications
✓Translate text
✓Identify faces
✓Recognize speech
✓Read handwritten text
✓Control robots

Artificial Neuron Working Principle
• x1 and x2 are the two inputs. They could be an integer, float etc. Assume them as 1
and 0 respectively.
• When these inputs pass through the connections ( through W1 and W2 ), they are
adjusted depending on the connection weights.
• Let’s assume, W1 = 0.5, W2 = 0.6 and W3 = 0.2 , then adjusting the weights using
z = x1 * W1 + x2 * W2 = 1 * 0.5 + 0 * 0.6 = 0.5
• Hence , 0.5 is the value adjusted by the weights of the connection.
• The connections are assumed to be the dendrites in an artificial neuron.
• To process the information, need an activation function ( soma ) .
• Here, use the simple sigmoid function: f(x) = 1/(1+e-z)
• f ( x ) = f ( 0.5 ) = 0.6224
• The above value is the output of the neuron ( axon ).
• This value needs to be multiplied by W3 .
• 0.6224 * W3 = 0.6224 * 0.2 = 0.12448
• Now, Again apply the activation function to the above value ,
• f ( x ) = f ( 0.12448 ) = 0.5310
• Hence, y ( final prediction) = 0.5310
• In this example the weights were randomly generated.

Working of Neural Network
➢ A neural network contains different layers.
➢ Input layer - The first layer is the, it picks up the input signals and passes
them to the next layer.
➢ Hidden layer - The next layer does all kinds of calculations and feature
extractions.
➢ Often, possibility of more than one hidden layer.
➢ Output layer - Provides the final result.
Classification of Neural Networks
1. Shallow neural network: The Shallow neural network has only one
hidden layer between the input and output.
2. Deep neural network: Deep neural networks have more than one hidden
layers. For instance, Google LeNet model for image recognition counts 22
layers.

Working of Neural Network
The complete training process of a neural network involves two steps.
1. Forward Propagation - Receive input data, process the information, and generate output
• Images are fed into the input layer in the form of numbers. These numerical values denote the intensity of
pixels in the image.
• The neurons in the hidden layers apply a few mathematical operations on these values to learn the features.
• To perform these mathematical operations, there are certain parameter values that are randomly initialized.
• Pass these mathematical operations at the hidden layer, the result is sent to the output layer which generates
the final prediction.
2. Backward Propagation - Calculate error and update the parameters of the network
• Once the output is generated, the next step is to compare the output with the actual (Ground truth/label) value
y.
• Based on the final output, and how close or far (difference) this is from the actual value (error), the values of
the parameters are updated.
• The forward propagation process is repeated using the updated parameter values and new outputs are
generated.

Activation function
• Activation function defines the output of given input or set of inputs or in other terms defines node of the
output of node that is given in inputs.
• They will deactivate or activate neurons to get the desired output.
• It also performs a nonlinear transformation on the input to get better results on a complex neural network.
• Activation function also helps to normalize the output of any input in the range between 1 to -1.
• Activation function must be efficient and it should reduce the computation time because the neural network
sometimes trained on millions of data points.
• Activation function decides in any neural network that given input or receiving information is relevant or it is
irrelevant

Types of Activation Functions
1. Binary step
2. Linear activation function
Non Linear Activation Functions:
3. ReLU
4. LeakyReLU
5. Sigmoid
6. Tanh
7. Softmax

1. Binary (Unit) Step Activation Function
• Binary step activation function is a threshold bases classifier. If the input value is above or below a certain
threshold, the neuron is activated and sends exactly the same signal to the next layer.
• f(x) = 1 if x > 0
0 if x < 0
• Here, threshold value is 0.
• It is very simple and useful to
classify binary problems or
classifier.
• It does not allow multi-value outputs
i.e. cannot support classifying the
inputs into one of several categories.

2. Linear Activation Function
• It is a simple straight line activation function where function is directly proportional to the weighted sum of neurons or
input. Linear activation functions are better in giving a wide range of activations and a line of a positive slope may
increase the firing rate as the input rate increases. Range is –infinity to +infinity
f(x) = X
• Not possible to use backpropagation (gradient descent) to train the model—the derivative of the function with respect
to X is a constant, and has no relation to the input, X. So it’s not possible to go back and understand which weights in the
input neurons can provide a better prediction.
• All layers of the neural network collapse into single layer with linear activation functions, no matter how many layers
in the neural network, the last layer will be a linear function of the first layer (because a linear combination of linear
functions is still a linear function). So a linear activation function turns the neural network into just one layer.
• A neural network with a linear activation function is simply a linear regression model. It has limited power and ability to
handle complexity varying parameters of input data.

3. ReLU( Rectified Linear unit) Activation function
• Rectified linear unit or ReLU is most widely used activation function. Range is from 0 to infinity.
• It is converging quickly. So, Fast Computation.
• Disadvantage:
• Dying ReLU’ problem: when inputs approach zero, or are negative, the gradient of the function becomes zero, the
network cannot perform backpropagation and cannot learn.
• To avoid this unfitting, use Leaky ReLU function instead of ReLU, in Leaky ReLU range is expanded which
enhances the performance.

4. Leaky ReLU Activation Function
• Leaky ReLU activation function to solve the ‘Dying ReLU’ problem. This is variation of ReLU.
• Hence, it has a small positive slope in the negative area, so it does enable backpropagation, even for negative
input values.
• Disadvantage: Leaky ReLU does not provide consistent predictions for negative input values.

5. Sigmoid Activation Function
• The sigmoid activation function is a probabilistic approach towards decision making and output range is
between 0 to 1 i.e. normalizing the each output of neuron.
• Since the range is the minimum, prediction (decision) would be more accurate.
• The equation for the sigmoid function is f(x) = 1/(1+e(-x) )
• Disadvantage: Vanishing gradient problem occurs because it convert large input in between the range of 0 to 1
and therefore their derivatives become much smaller which does not give satisfactory output.
• To solve this problem, ReLU is used which do not have a small derivative problem.

6. Hyperbolic Tangent Activation Function(Tanh)
• This activation function is slightly better than the sigmoid function, like the sigmoid function.
• Zero centered— Making it easier to model inputs that have strongly negative, neutral, and strongly positive
values.
• Range is in between -1 to 1.

7. Softmax Activation Function
• Softmax is used at the last layer i.e. output to classify inputs into multiple classes. Range is 0 to 1.
• The softmax basically gives value to the input variable according to their weight and the sum of these weights
(proportional to each other) is one i.e. sum of all the probabilities will be equal to one.
• For multi-class classification problem, use softmax and cross-entropy along with it.
• Negative inputs will be converted into nonnegative values using exponential function.

Parameter
Learning Rate
• Learning rate (step size) is hyper-parameter and range between 0.0 and 1.0
• The amount that the weights are updated during training.
• The learning rate controls how quickly the model is adapted to the problem.
• Smaller learning rates require more training epochs given the smaller changes made to the weights each update,
whereas larger learning rates result in rapid changes and require fewer training epochs.
Bias
• The bias is a node that is always 'on'. i.e. its value is set to 1.
• Bias is like the intercept in a linear equation. It is an additional parameter in the Neural Network which is used
to adjust the output along with the weighted sum of the inputs to the neuron.
Output = sum (W * X) + bias
• Therefore Bias is a constant which helps the model in a way that it can fit best for the given data.
• Weight decide how fast the activation function will trigger whereas bias
is used to delay the triggering of the activation function.

Bias, Variance, underfit, overfit
➢ Bias: Error of the training data
➢ Variance: error of the test data
➢ Overfitting: low bias and high variance
➢ Under-fitting: high bias and high variance
➢ Goal always should be low bias and low variance.
➢ Eaxmple: Regression problem(Lets take linear regression model to create a best fit line.

L2 = weight decay
• Regularization term that penalizes big weights, added to the objective function
• Weight decay value determines how dominant regularization is during gradient computation
• Big weight decay coefficient → big penalty for big weights
Regularization
Dropout
• Randomly drop units (neurons along with their connections) during training
• Each unit retained with fixed probability p, independent of other units
• Hyper-parameter p to be chosen (tuned)
𝐽𝑟𝑒𝑔 𝜃 = 𝐽 𝜃 + 𝜆 ෍
𝑘
𝜃𝑘
2
Early-stopping
• Use validation error to decide when to stop training
• Stop when monitored quantity has not improved after n subsequent epochs
• n is called patience

Tuning hyper-parameters
• “Grid and random search of 9 trials for optimizing function g(x) ≈ g(x) + h(y)
• With grid search, nine trials only test g(x) in three distinct places.
• With random search, all nine (parameters) trials explore distinct values of g. ”
• Both try configurations randomly and blindly
• Next trial is independent to all the trials done before
• Bayesian optimization for hyper-parameter tuning:
• Make smarter choice for the next trial, minimize the number of trials
• Collect the performance at several configurations
• Make inference and decide what configuration to try next.
g(x) ≈ g(x) + h(y)
g(x) shown in green
h(y) is shown in yellow

• A neural network is structured like the human brain and consists of artificial neurons, also known as nodes. These
nodes are stacked next to each other in three layers:
❖ input layer
❖ hidden layer(s)
❖ output layer
• Data provides each node with information in the form of inputs. The node multiplies the inputs with random
weights, calculates them, and adds a bias. Finally, nonlinear functions, also known as activation functions, are
applied to determine which neuron to fire.
Working of Neural Network (Recall……….)

Deep Learning
• Deep learning has an inbuilt automatic multi stage feature learning process that learns rich hierarchical
representations (i.e. features).
• Deep Neural Network – Neural Network that contains more than one hidden layer between input and output.
• The series of hidden layers between input & output perform feature identification and processing in a series of
stages like our brain.
• Many algorithms are learning the weights for networks with more hidden layers in deep learning.
Input Layer Hidden Layer1 Hidden Layer2 Hidden Layer3 Hidden Layer4
Output Layer

Why is DL needed?
Performance vs Sample Size
Size of Data
Performance
Traditional ML
algorithms
o Manually designed features in ML are often over-specified, incomplete and take a long time to design and validate.
o Learned Features are easy to adapt, fast to learn.
o Deep learning provides a very flexible, universal, learnable framework for representing world, visual and linguistic
information.
o Can learn both unsupervised and supervised
o Effective end-to-end joint system learning
o Utilize large amounts of training data

Multi-Layer Neural Network training
Train this layer first

then this layer

then this layer
then this layer

then this layer
then this layer
then this layer

then this layer
then this layer
then this layer
finally this layer

EACH of the (non-output) layers is trained
to learn good features that describe what
comes from the previous layer

Deep learning Process
• A deep neural network learn automatically, without predefined knowledge explicitly coded by the
programmers.
• Each layer represents a deeper level of knowledge, i.e., the hierarchy of knowledge.
• The learning occurs in two phases.
• The first phase (Forward Propagation) consists of applying a nonlinear transformation of the
input and create a statistical model as output.
• The second phase (Backpropagation) aims at improving (optimize)the model with a mathematical
method known as derivative.
• The neural network repeats these two phases hundreds to thousands of time until it has reached a
tolerable level (converging) of accuracy. The repeat of this two-phase is called an iteration.

Deep learning Process
• Deep learning has an inbuilt automatic multi stage feature learning process that learns
rich hierarchical representations (i.e. features).
Low Level
Features
Mid Level
Features
Output
(e.g. outdoor,
indoor)
High
Level
Features
Trainable
Classifier
• Image
Pixel Edge Texture Motif Part Object
• Text
Character Word Word-group Clause Sentence Story
E.g.

Deep Learning Frameworks
1. TensorFlow
2. Keras
3. PyTorch
4. Theano
5. DL4J
6. Caffe
7. Chainer
8. Microsoft CNTK

Deep Learning Applications
• Social Network Filtering,
• Fraud Detection,
• Image and Speech Recognition,
• Audio Recognition,
• Computer Vision,
• Medical Image Processing,
• Bioinformatics,
• Customer Relationship Management,
• Driverless car
• Mobile phone
• Google Search Engine
• TV, etc.

Types of Deep Learning Algorithms
1. Convolutional Neural Networks (CNNs)
2. Recurrent Neural Networks (RNNs)
3. Long Short Term Memory Networks (LSTMs)
4. Generative Adversarial Networks (GANs)
5. Radial Basis Function Networks (RBFNs)
6. Multilayer Perceptrons (MLPs)
7. Self Organizing Maps (SOMs)
8. Deep Belief Networks (DBNs)
9. Restricted Boltzmann Machines( RBMs)
10. Autoencoders

Convolutional Neural Networks (CNNs)
• CNN's (ConvNets) consist multiple layers.
• Yann LeCun developed the first CNN in 1988 called LeNet. It was used for recognizing characters like ZIP
codes and digits.
• CNN's are widely used for image processing such as identify satellite images, process medical images, forecast
time series and object detection such as detect anomalies.
• Output: Binary, Multinomial, Continuous, Count
• Input: fixed size, can use padding to make all images same size.
• Architecture: User’s Choice (depends on application) which requires experimentation.
• Optimization: Backward propagation
• hyper parameters for very deep model can be estimated properly only if you have billions of images.
• Use an architecture and trained hyper parameters that already experimented by researchers (Imagenet or
Microsoft/Google APIs etc)

Convolutional Neural Networks (CNNs)
• Build a Convolutional Neural Network
• Initialize the weights and bias randomly.
• Fix the input and output.
• Forward Propagation: Receive input data, process the information, and generate output
• Backward Propagation: Calculate error, gradient and update the parameters of the network
Architecture:
• Build a Convolutional Neural Network with hidden layers.
• Each hidden layer may have separate activation function like Relu or sigmoid.
• Final layer will have Softmax.
• Error is calculated using cross-entropy.

CNN's have multiple layers that process the input and extract features from data:
1. Convolution Layer
• CNN has a convolution layer that has several filters to perform the convolution operation.
2. Rectified Linear Unit (ReLU)
• CNN's have a ReLU layer to perform operations on elements. The output is a rectified feature map.
3. Pooling Layer
• The rectified feature map next feeds into a pooling layer. Pooling is a down-sampling operation that reduces the
dimensions of the feature map.
• The pooling layer then converts the resulting two-dimensional arrays from the pooled feature map into a single, long,
continuous, linear vector by flattening it.
4. Fully Connected Layer
• A fully connected layer forms when the flattened matrix from the pooling layer is fed as an input, which classifies and
identifies the images.
Convolutional Neural Networks (CNNs) - Working Principle

Parameters Computation
4*4 + 4 +1
x1
x2
x3
x4
Input Hidden Layers
Y
Output
a1(1)
a2(1)
a3(1)
a4(1)
a1(2)
Softmax

Number of Parameters
480000*480000 + 480000 +1 = approximately 230 Billion !!!
400 X 400 X 3
x1
x2
x3
x480000
Input Hidden Layers
Y
Output
a1(1)
a2(1)
a3(1)
a480000(1)
a1(2)
Parameters Computation for Image

Convolutional Layer – Feature Extraction
• Filter
1 1 1 1 1 1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843
0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 1 0.996078 0.996078 0.996078 0.058824 0.015686
0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765
0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 1 0.988235 0.027451 0.015686 0.007843 0.007843 1 0.352941
0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078
0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157
0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843
0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608
0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529
1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529
0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1
0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1
0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1
0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078
0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686
0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765
0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078
0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1
0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1
0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1
0 1 0
1 -4 1
0 1 0
Input Image Convoluted Image
▪ Inspired by the neurophysiological experiments conducted by Hubel and Wiesel 1962.

• What is Convolution?
Input Image Convolved Image
(Feature Map)
a b c d
e f g h
i j k l
m n o p
w1 w2
w3 w4
Filter
h1 h2
ℎ1 = 𝑓 𝑎 ∗ 𝑤1 + 𝑏 ∗ 𝑤2 + 𝑒 ∗ 𝑤3 + 𝑓 ∗ 𝑤4
ℎ2 = 𝑓 𝑏 ∗ 𝑤1 + 𝑐 ∗ 𝑤2 + 𝑓 ∗ 𝑤3 + 𝑔 ∗ 𝑤4
Convolutional Layer Process
http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Input matrix
Convolutional
3x3 filter
• Example:

▪ In Convolutional neural networks, hidden units are only connected to local receptive field.
Input Image
Layer 1
Feature Map
Layer 2
Feature Map
w1 w2
w3 w4
w5 w6
w7 w8
Filter 1
Filter 2
Convolutional Layer Process

Pooling
• Max pooling: Provides the maximum output within a rectangular neighborhood.
• Average pooling: Provides the average output of a rectangular neighborhood.
• Stride – Number of steps (columns when left to right /rows while top to bottom) movement of pooling
process
1 3 5 3
4 2 3 1
3 1 1 3
0 1 0 4
MaxPool with 2 X 2 filter with
stride of 2
Input Matrix Output Matrix
4 5
3 4

Fully Connected Layer
• So far, the convolution layer has extracted some valuable features from the data. These features are sent to the fully
connected layer that generates the final results.
• The fully connected layer in a CNN is just the traditional neural network.
• The output from the convolution layer is a 2D matrix. Ideally, we would want each row to represent a single input
image.
• Actually, the fully connected layer can only work with 1D data. Hence, the values generated from the previous
operation are first converted into a 1D format.
• Once the data is converted into a 1D array, it is sent to the fully connected layer. All of these individual values are
treated as separate features that represent the image.
• The fully connected layer performs two operations on the incoming data – a linear transformation (by activation
function) and a non-linear transformation (capture complex relationships).

CNN – Backward Propagation
• During the forward propagation process, we randomly initialized the weights, biases and filters.
• These values are treated as parameters from the convolutional neural network algorithm.
• In the backward propagation process, the model tries to update the parameters such that the overall predictions are more
accurate.
• For updating these parameters, use the gradient descent technique.
Gradient Descent to minimize (optimize) the loss(error):
• Consider that following in the curve for loss function where we have a parameter a:
• During the random initialization of the parameter, we get the value of a as a2.
• It is clear from the picture that the minimum value of loss is at a1 and not a2.
• The gradient descent technique tries to find this value of parameter (a) at which the loss is minimum.
• From the figure, We understand that we need to update the value a2 and bring it closer to a1.
• To decide the direction of movement, i.e. whether to increase or decrease the value of the parameter, we calculate the
gradient or slope at the current point.
• Based on the value of the gradient, we can determine the updated parameter values.
• When the slope is negative, the value of the parameter will be increased.
• When the slope is positive, the value of the parameter should be decreased by a small amount.
Generic equation for updating the parameter values:
• new_parameter = old_parameter - (learning_rate * gradient_of_parameter)
• The learning rate is a constant that controls the amount of change being made to the old value. The slope or the gradient determine the direction of the new
values, that is, should the values be increased or decreased. So, we need to find the gradients, that is, change in error with respect to the parameters in order to

• CNN contains three parameters – weights, biases and filters. Let calculate the gradients for these parameters one by one.
Backward Propagation at Fully Connected Layer
• Fully connected layer has two parameters – weight matrix and bias matrix.
• Let us start by calculating the change in error with respect to weights – ∂E/∂W.
• Since the error is not directly dependent on the weight matrix, we will use the concept of chain rule to find this value.
• The computation graph shown below will help us define ∂E/∂W:
• ∂E/∂W = ∂E/∂O . ∂O/∂Z2. ∂z/∂W
Find these derivatives separately:
1. Change in error with respect to output
• Suppose the actual values for the data are denoted as y’ and
• the predicted output is represented as O.
• Then the error is: E = (y' - O)2/2
• If we differentiate the error with respect to the output: ∂E/∂O = -(y'-O)
2. Change in output with respect to Z2 (linear transformation output)
• To find the derivative of output O with respect to Z2, we must first define O in terms of Z2. The output is sigmoid of Z2.
Thus, ∂O/∂Z2 is effectively the derivative of Sigmoid function: f(x) = 1/(1+e^-x)
• The derivative of this function: f'(x) = (1+e-x)-1[1-(1+e-x)-1]
f'(x) = sigmoid(x)(1-sigmoid(x))
∂O/∂Z2 = (O)(1-O)

3. Change in Z2 with respect to W (Weights)
• The value Z2 is the result of the linear transformation process. Here is the equation of Z2 in terms of weights:
Z2 = WT.A1 + b
• On differentiating Z2 with respect to W, we will get the value A1 itself:
∂Z2/∂W = A1
• Now that we have the individual derivations, we can use the chain rule to find the change in error with respect to weights:
∂E/∂W = ∂E/∂O . ∂O/∂Z2. ∂Z2/∂W
∂E/∂W = -(y'-o) . sigmoid'. A1
• The shape of ∂E/∂W will be the same as the weight matrix W.
•
• We can update the values in the weight matrix:
W_new = W_old - lr*∂E/∂W
4. Change in Z2 with respect to b (bias)
b_new = b_old - lr*∂E/∂b

Backward Propagation at Convolution Layer
• For the convolution layer, we had the filter matrix as a parameter. During the forward propagation process, we randomly
initialized the filter matrix.
• Update these values using:
new_parameter = old_parameter - (learning_rate * gradient_of_parameter)
• To update the filter matrix, find the gradient of the parameter ∂E/∂f.
• Figure shows the computation for backward propagation.
• The derivative ∂E/∂f as:
∂E/∂f = ∂E/∂O.∂O/∂Z2.∂Z2/∂A1 .∂A1/∂Z1.∂Z1/∂f
• We have already determined the values for ∂E/∂O and ∂O/∂Z2.
• Let us find the values for the remaining derivatives.
1. Change in Z2 with respect to A1
• To find the value for ∂Z2/∂A1 , we need to have the equation for Z2 in terms of A1:
Z2 = WT.A1 + b
• On differentiating the above equation with respect to A1, we get WT as the result: ∂Z2/∂A1 = WT
2. Change in A1 with respect to Z1
• The next value that we need to determine is ∂A1/∂Z1. Have a look at the equation of A1. A1 = sigmoid(Z1)
• This is simply the Sigmoid function. The derivative of Sigmoid would be: ∂A1/∂Z1 = (A1)(1-A1)

Backward Propagation at Convolution Layer
3. Change in Z1 with respect to filter f
• Finally, we need the value for ∂Z1/∂f. Here’s the equation for Z1
Z1 = X * f
• Differentiating Z with respect to X will simply give us X:
∂Z1/∂f = X
• Now that we have all the required values,
• let’s find the overall change in error with respect to the filter:
∂E/∂f = ∂E/∂O.∂O/∂Z2.∂Z2/∂A1 .∂A1/∂Z1 * ∂Z1/∂f
• Notice that in the equation above, the value (∂E/∂O.∂O/∂Z2.∂Z2/∂A1 .∂A1/∂Z) is convolved with ∂Z1/∂f instead of using a
simple dot product.
• Why? The main reason is that during forward propagation, we perform a convolution operation for the images and filters.
• This is repeated in the backward propagation process.
• Once we have the value for ∂E/∂f, we will use this value to update the original filter value:
f = f - lr*(∂E/∂f)
• This completes the backpropagation for convolutional neural networks.

Recall - Deep learning Process
• Deep learning has an inbuilt automatic multi stage feature learning process that learns
rich hierarchical representations (i.e. features).
Low Level
Features
Mid Level
Features
Output
(e.g. outdoor,
indoor)
High
Level
Features
Trainable
Classifier
• Image
Pixel Edge Texture Motif Part Object
• Text
Character Word Word-group Clause Sentence Story
E.g.

Top 7 Applications of Convolutional Neural
Networks
1.Decoding Facial Recognition
Facial recognition is broken down by a convolutional neural network into the
following major components -
• Identifying every face in the picture
• Focusing on each face despite external factors, such as light, angle, pose,
etc.
• Identifying unique features
• Comparing all the collected data with already existing data in the database
to match a face with a name.
similar process is followed for scene labeling as well.

2.Analyzing Documents
Convolutional neural networks can also be used for document analysis. This is not just useful for
handwriting analysis, but also has a major stake in recognizers. For a machine to be able to scan an
individual's writing, and then compare that to the wide database it has, it must execute almost a
million commands a minute. It is said with the use of CNNs and newer models and algorithms, the
error rate has been brought down to a minimum of 0.4% at a character level, though it's complete
testing is yet to be widely seen.
3.Historic and Environmental Collections
CNNs are also used for more complex purposes such as natural history collections. These
collections act as key players in documenting major parts of history such as biodiversity, evolution,
habitat loss, biological invasion, and climate change.
4.Understanding Climate
CNNs can be used to play a major role in the fight against climate change, especially in
understanding the reasons why we see such drastic changes and how we could experiment in
curbing the effect. It is said that the data in such natural history collections can also provide greater
social and scientific insights, but this would require skilled human resources such as researchers
who can physically visit these types of repositories. There is a need for more manpower to carry
out deeper experiments in this field.

5.Grey Areas
Introduction of the grey area into CNNs is posed to provide a much more realistic picture of the
real world. Currently, CNNs largely function exactly like a machine, seeing a true and false
value for every question. However, as humans, we understand that the real world plays out in a
thousand shades of grey. Allowing the machine to understand and process fuzzier logic will help
it understand the grey area us humans live in and strive to work against. This will help CNNs get
a more holistic view of what human sees.
6.Advertising
CNNs have already brought in a world of difference to advertising with the introduction of
programmatic buying and data-driven personalized advertising.
7. Other Interesting Fields
CNNs are poised to be the future with their introduction into driverless cars, robots that can mimic human
behavior, aides to human genome mapping projects, predicting earthquakes and natural disasters, and maybe
even self-diagnoses of medical problems.
So, you wouldn't even have to drive down to a clinic or schedule an appointment with a doctor to ensure your
sneezing attack or high fever is just the simple flu and not symptoms of some rare disease.
One problem that researchers are working on with CNNs is brain cancer detection.
The earlier detection of brain cancer can prove to be a big step in saving more lives affected by this illness.

References
• Richard Szeliski, Computer Vision: Algorithms and Applications, Springer 2010
• Ian Goodfellow, Yoshua Bengio, Aaron Courvill, Deep Learning

Deep learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning

Similar to Deep learning (20)

More from Kuppusamy P

More from Kuppusamy P (20)

Recently uploaded

Recently uploaded (20)

Deep learning