CNN_AH.pptx

INTRODUCTION TO
CONVOLUTIONAL NEURAL
NETWORK
Dr. Anindya Halder
Associate Professor, Cotton University

Introduction
• In the previous slides we learned the basics of Deep neural network and its types and
use cases.
• In this section we will learn one of its kind which is Convolutional Neural Network
(CNN) Architecture

What is CNN ?
• A Convolutional Neural Network, also known as CNN or ConvNet, is a type of feed-
forward neural networks that specializes in processing data that has a grid-like
topology, such as an image.
• A digital image is representation of visual data. It contains a series of pixels arranged
in a grid-like fashion that contains pixel values.
• Because of this kind of representation CNN is used for image classification.
• The architecture of CNN is designed to take
advantage of the 2D structure of an input
image.
• The basic CNN is comprised of one or more
convolution layer (often with a pooling step) and
then followed by one or more fully connected
layers as in a standard multilayer neural
network.

Motivation behind CNN ?
• Consider an image of size 200x200x3 (200 wide, 200 high, 3 color channels)
A single fully-connected neuron in a first hidden layer of a regular Neural Network would have
200x200x3 = 120000 weights
Due to the presence of several such neurons, this full connectivity is wasteful, and the huge
number of parameters would quickly lead to overfitting.
• However, in a CNN the neurons in a layer will only be connected to a small region of
the layer before it (will discuss later) instead of all the neurons in a fully connected
manner.
The final output layer would have dimensions 1x1xN, because by the end of the CNN
architecture we will reduce the full image into a single vector of class scores (for N classes),
arranged along the depth dimension.

MLP vs CNN ?
Multi-layered perceptron: all layers are fully
connected
Convolutional Neural Network with partially
connected Convolution layer

MLP vs CNN ?
Multi-layered perceptron: a regular 3-layer
neural network
Convolutional Neural Network arranges its
neuron in 3 dimensions as visualized in
figure.
Because of this 3-D distribution of neurons CNN is intelligently adapted to the properties of images:
• Pixel position and neighborhood have semantic meanings
• Elements of interest can appear anywhere in the image

How CNN works – What computer sees
• For example, a CNN can take an image which can be classified a ‘X’ or ‘O’
• In simple case ‘X’ would look like
• But what about trickier case
• Since pattern does not match exactly, the computer will not be able to classify this as ‘X’.
Using CNN, we can overcome this issue by taking some measures.

CNN layers
• CNN consist of four basic layers
• Convolutional layer (CONV) will compute the output of neurons that are connected to local
regions in the input, each computing a dot product between their weights and a small region
they are connected to in the input volume.
• RELU (already discussed in ANN) layer will apply an elementwise activation function, such
as the max(0,x) thresholding at zero. This leaves the size of the volume unchanged. Which
removes no-linearity from data.
• Pooling (POOL) layer will perform a down sampling operation along the spatial dimensions
(width, height). Sometimes we also use DROPOUT for down sampling.
• Fully-connected layer (FC) will compute the class scores, resulting in volume of size
[1x1xN], where each of the N numbers correspond to a class score, such as among the N
categories.

Convolutional Layer
• The convolution layer (CONV) uses filters that perform convolution operations as it is scanning
the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride
S. The resulting output O is called feature map or activation map.
• Convolution layer will work to identify patterns (features) instead of individual pixels.
• The role of the ConvNet is to reduce the images into a form which is easier to process, without
losing features which are critical for getting a good prediction.

What is Convolution
operation?
• Mathematically, convolution is the summation
of the element-wise product of 2 matrices (input
image and filter).
• Let us consider an image ‘X’ & a filter ‘Y’ (More
about filter will be covered later). Both X & Y,
are matrices (image X is being expressed in the
state of pixels). When we convolve the image
‘X’ using filter ‘Y’, we produce the output in a
matrix, say’ Z’.
• Finally, we compute the sum of all the elements
in ‘Z’ to get a scalar number
image X
kernel Y
Convolution operation

Convolutional Layer - Filters/Kernels
• A filter provides a measure for how close a patch or a region of the input resembles a feature. A
feature may be any prominent aspect – a vertical edge, a horizontal edge, an arch, a diagonal,
etc.
• A filter acts as a single template or pattern, which, when convolved across the input, finds
similarities between the stored template & different locations/regions in the input image.
• To perform convolution operation, slide the filter over the width and height of the input image
and perform summation of the element-wise product.
• If the input image size is ‘n x n’ & filter size is ‘f’
• Output size = (n – f + 1) x (n – f + 1)
• Output size = (5-3+1) x (5-3+1) = 3x3

Filter hyperparameters - Padding
• Sometimes it is convenient to pad the input volume with zeros around the border.
• Zero padding is allowed us to preserve the spatial size of the output volumes.
• Why do we do Padding?
• Every time we apply a convolution operator, our image shrinks. So, we lose a lot of
information because of image shrinking, which is one of the downsides of convolution.
• So, to fix these problems, we can ‘pad’ the image.
One bit Zero padding on a 5x5 image
• Let P be padding. In this example, p = 1
because we padded all around the input image
with an extra border of 1 pixel.
• Output Size = (n + 2p –f +1) x (n + 2p –f +1)
where, n is the image dimension, p is the
padding and f is the filter-size

Types of Padding
• There are two common choices for padding: Valid convolutions & the Same convolutions.
a) Valid convolutions - This Means no padding. Thus, in this case, we might have (nxn) image
convolve with (fxf) filter & this would give us an output (n-f+1) x (n-f+1) dimensional output.
b) Same convolutions - In this case, padding is such that the output size is the same as the
input image size. When we do padding by ‘p’ pixels then, size of the input image changes
from (nxn) to (n + 2p –f +1) x (n + 2p –f +1).
The amount of padding to be done should be such that the output image after convolution
matches the size of the input image.
Let, n x n = Original input image size, p = Padding
(n+2p) x (n+2p) = Size of padded input image
(n+2p–f+1) x (n+2p-f+1) = Size of output image after convolving padded image
To avoid shrinkage of the original input image, we calculate ‘p = padding size’.
So, we achieve Output size after convolving padded image = Original input image size

How is the Filter Size Decided?
• By convention, the value of ‘f,’ i.e., filter size, is usually odd in computer vision. This might be
because of 2 reasons:
• If the value of ‘f’ is even, we may need asymmetric padding (according the previous slide).
Let us say that the size of the filter i.e., ‘f’ is 6. Then by using equation of padding, we get a
padding size of 2.5, which does not make sense.
Let, nxn = 10 x 10 = Original input image size, p = Padding and f = 6
Output image = (10+2p–6+1) x (10+2p-6+1) = 10x10
because we want out output image same as input
and we get p=2.5 which is not make any sense
• The 2nd reason for choosing an odd size filter such as a 3×3 or a 5×5 filter is we get a central
position & at times it is nice to have a distinguisher.

Filter hyperparameters - Stride
• For a convolutional or a pooling operation, the stride S denotes the number of pixels by
which the window moves after each operation.
• In simple words the stride indicates the pace by which the filter moves horizontally &
vertically over the pixels of the input image during convolution.
• Let n x n = Original input image size, p = Padding, f = kernel and s = stride
Output image size = [{(n + 2p - f) / s} + 1] x [{(n + 2p - f) / s} + 1]
Convolution Operation with Stride Length = 2
Stride during convolution

Convolutions over RGB images
• Consider an RGB image of size 6×6. Since it’s an RGB image, its dimension is 6x6x3, where
the three corresponds to the three colors channels: Red, Green & Blue. We can imagine this
as a 3-D image with a stack of 3 six by six shots.
• For 3-D images, we need 3D filters, i.e., the filter itself will also have three layers
corresponding to the red, green & blue channels, like that of the input RGB image.
Convolution over volume
• We 1st place the 3x3x3 filter in the upper left
most position same as 2-D. This filter has 27 (9
parameters in each channel) or numbers.
• We take each of these 27 numbers & multiply
them with the corresponding numbers from the
image’s red, green & blue channels.
• Then we add up all those numbers & this gives
us the 1st number in the output image.

How Convolutions over RGB images works

Multiple Filters for Multiple Features
• We can use multiple filters to detect various features simultaneously.
• Let us consider the following example in which we see vertical edge & curve in the input RGB
image.
• We will have to use two different filters for this task, and the output image will thus have two
feature maps.
Convolution using multiple filters
• Let us understand the dimensions mathematically

Some important concepts
• The filters are learned during training (i.e., during backpropagation). Hence, the individual
values of the filters are often called the weights of CNN.
• A neuron is a filter whose weights are learned during training. E.g., a (3,3,3) filter (or neuron)
has 27 units. Each neuron looks at a particular region in the output (i.e., its ‘receptive field’)
• A feature map is a collection of multiple neurons, each looking at different inputs with the
same weights.
• All neurons in a feature map extract the same feature (but from other input regions). It is
called a ‘feature map’ because it maps where a particular part is found in the image.

ReLU Layer
• ReLU is a piecewise linear function that will output the input
directly if it is positive, otherwise, it will output zero.
• The main catch here is that the ReLU function does not activate all
the neurons at the same time.
• Mathematically it can be represented as:
• The derivative of the function is:

Pooling Layer
• A pooling layer is another essential building block of CNN. It tries to figure out whether a
particular region in the image has the feature we are interested in or not.
• The pooling layer (POOL) is a down sampling operation, typically applied after a convolution
layer, which does some spatial invariance.
• The two most popular aggregate functions used in pooling are ‘max’ & ‘average’:
a) Max pooling – If any of the patches say something firmly about the presence of a particular feature,
then the pooling layer counts that feature as ‘detected’. It preserves detected features and mostly
used.
b) Average pooling – If one patch says something very firmly, but the other ones disagree, the average
pooling takes the average to find out. It down samples feature map and used in LeNet.

Pooling Layer – Advantage and Disadvantage
• Advantages
• Pooling has the advantage of making the representation more compact by reducing the
spatial size of the feature maps, thereby reducing the number of parameters to be learnt.
• Pooling reduces only the height & width of the feature map, not the number of channels
• Disadvantage
• Pooling also loses a lot of information, which is often considered a potential disadvantage

Dropout Layer
• Large neural nets trained on relatively small datasets can overfit the training data which
results in poor performance when the model is evaluated on new data.
• Dropout is a regularization method that approximates training a large number of neural
networks with different architectures in parallel.
• During training, some number of layer outputs are randomly ignored or “dropped out.” in this
layer.
• Dropout has the effect of making the training
process noisy, forcing nodes within a layer to
probabilistically take on more or less responsibility
for the inputs.

Fully connected Layer
• Fully connected layers are the normal flat
feed-forward neural network layers.
• This layers may have some non-linear
activation function or mostly softmax
activation function in order to predict
classes.
• To compute the output, we basically
arrange all the output 2-D matrices as a
1-D array.

Fully connected Layer
• A summation of product of inputs and weights at each output node determines the final
prediction. Same as what we do during feed-forward network.

Understanding the complexity of the CNN
• In order to assess the complexity of a model, it is often useful to determine the number of
parameters that its architecture will have. In a given layer of a convolutional neural network, it
is done as follows:

How image recognition works with CNN ?
• Till now we have seen different components of CNN. Now let see how different component
work together in CNN to identify an image of a bird.

Different CNN Architectures
• There are various architectures of CNNs available which have been key in building
algorithms which power and shall power AI in the foreseeable future. Some of them have
been listed below:

Summary
• In this section we learn
• Basics of CNN
• How CNN is different from other ML algorithms
• Understand layers of CNN
• How CNN classify/recognize images
• Different CNN architectures
• In the next section we will learn Recurrent Neural network

CNN_AH.pptx

Recommended

Recommended

More Related Content

Similar to CNN_AH.pptx

Similar to CNN_AH.pptx (20)

Recently uploaded

Recently uploaded (20)

CNN_AH.pptx