MACHINE LEARNING BASED OBJECT
RECOGNITION USING CIFAR-10
Made by:
• 2020UEC2528 MAYANK DEWAN
• 2020UEC2539 RUDRAKSH SHARMA
• 2020UEC2559 GAURAV RAJORA
Under the supervision of:
Dr. Satya Prakash Singh
CONTENTS
• Introduction
• Literature Survey
• Gaps Identified
• Objectives Of The Project
• Problem Statement
• Solution Methodology
• Results
• Future Work To Be Done
• Expected Outcomes Of The Work
Introduction
Preschool education, also known as early childhood education or Kindergarten is typically the first stage of formal education
that children receives. Its primary focus is on preparing young learners for a smooth transition into more structured and formal
education. It holds immense significance in a child's development, providing a strong foundation for lifelong learning and
success.
Key aspects of education in kindergarten include:
• Introduction to Academic Basics: Kindergarten introduces children to basic academic concepts such as letters, numbers,
shapes, and colours. It aims to familiarize them with the fundamental building blocks of reading, writing, and math.
• Language and Literacy Development: Kindergarteners are exposed to language development activities that include
vocabulary building, listening skills, and basic phonics. They may begin to recognize and write letters and words.
• Social and Emotional Skills: Kindergarten is a significant stage for the development of social and emotional skills. Children
learn how to interact with their peers, share, take turns, and express their feelings in appropriate ways.
Machine learning can be applied in various ways to enhance and support preschool learning, benefiting both educators and
young learners. One such application is image classification which can help young learners to learn about objects like scale,
pencil, ball, newspaper, books, animals, pets and learn about them. User-friendly interfaces, colourful visuals can aide learning
and make it interactive and easy.
Object Recognition is a computer vision task that involves identifying and classifying objects within an image or a video
stream. The goal of object recognition is to determine what objects are present in the scene and assign them to predefined
categories or classes. It outputs the names or labels of the objects present in the image or video.
Object Detection, on the other hand, is a related computer vision task that goes beyond object recognition. Object detection
not only identifies the objects within an image or video but also provides information about their location or spatial extent in
the image. It outputs both the object labels and the coordinates of bounding boxes around each detected object.
Image classification using the CIFAR-10 dataset is a common computer vision task. CIFAR-10 is a dataset containing 60,000
32x32 colour images in 10 different classes, with 6,000 images per class. The objective is to train a machine learning or deep
learning model to classify these images into their respective categories.
Convolutional Neural Networks (CNNs) are a powerful and widely used architecture for image classification tasks involving
the CIFAR-10 dataset. CNNs automatically learn to extract hierarchical features from images. They use convolutional layers to
detect edges, textures, and more complex patterns in the images, which is essential for recognizing objects in CIFAR-10. CNNs
are designed to capture spatial hierarchies in images. Lower layers detect basic features like edges, while higher layers combine
these features to recognize more complex objects or patterns. CNNs use weight sharing in convolutional layers, which
significantly reduces the number of parameters in the model. This makes it feasible to train deep networks without an
impractical number of parameters. CNNs exhibit translation invariance, meaning they can recognize patterns regardless of their
position in the image. This is crucial for classifying objects that can appear anywhere in a picture. Data augmentation
techniques, such as image rotation, scaling, and flipping, can be easily applied to image data, increasing the diversity of the
training dataset and improving model generalization.
Literature Survey
A large number of articles and research papers have been studied to gather the information regarding convolutional neural
network techniques including IEEE research papers. Different image recognition models and different libraries have also been
searched on Kaggle to decide the best model suitable for us. Datasets like CIFAR 10, CIFAR 100 and VGG16 were studied
thoroughly to look for their advantages and disadvantages over one another. Different CNN models were considered, and
ResNet-50 is utilised along with CIFAR 10.
MODELS REFFERED
1. "Learning Multiple Layers of Features from Tiny Images"
• Author: Alex Krizhevsky
• Description: This paper introduced the CIFAR-10 dataset and presented a deep convolutional neural network (CNN) that
achieved state-of-the-art performance at the time. It laid the foundation for using deep learning in image classification.
Abstract: The paper primarily focuses on using deep CNNs for image classification and introduces the CIFAR-10 dataset
as a benchmark for evaluating these models. The CIFAR-10 dataset comprises 60,000 32x32 color images in ten different
classes, making it a suitable choice for testing deep learning architectures.
2. "Very Deep Convolutional Networks for Large-Scale Image Recognition"
• Authors: Karen Simonyan, Andrew Zisserman
• Description: This paper introduced the VGGNet architecture, which significantly deepened the networks used for image
classification. It has had a major influence on the design of deep CNNs for CIFAR-10 and other datasets.
• Abstract: The paper focuses on the development of deep convolutional neural network (CNN) architectures and their
application to large-scale image recognition. The authors present a network called VGG (Visual Geometry Group)Net,
characterized by its depth and uniformity, which makes it effective for various computer vision tasks, including image
classification.
3."Residual Networks"
Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al.
Description: This paper introduced ResNet, a deep neural network architecture with residual connections. ResNet
revolutionized image classification by enabling the training of very deep networks, leading to improved performance on
CIFAR-10 and other datasets.
Abstract: The paper addresses the problem of training very deep neural networks, which are prone to vanishing or exploding
gradients during training. To mitigate these issues, the authors propose the use of residual connections, which allow for the
training of networks with hundreds or even thousands of layers. The resulting architecture is called a Residual Network, or
ResNet.
Gaps Identified
Preschool education plays a critical role in a child's early development and lays the foundation for their future academic and
social success. However, like any educational system, pre schooling have gaps or shortcomings that can be identified and
analysed . Here are some common gaps and shortcomings in preschool education -
1. Classroom Size: The student-to-teacher ratio is a critical factor in the quality of early education. Statistics can show if
classrooms are overcrowded, which can negatively impact individualized attention and learning outcomes.
2. Quality Of Teaching: Many preschools lack trained teachers and age-appropriate curriculum, leading to variations in the
educational experience provided to young children.
3. Access Disparities: Many children, especially those from disadvantaged backgrounds, lack access to quality preschool
programs. Limited availability of preschools and financial barriers can result in unequal access to early education.
4. Curriculum Alignment: The curriculum in some preschool programs may not be developmentally appropriate or aligned
with best practices in early childhood education, impacting children's readiness for kindergarten.
There is also a lack of proper ML model that focuses on improvement of elemtary education that utilises object classfication
using datasets.
Objective Of The Project
To make a ML model which
can classify/recognise
objects using a pre trained
datasets.
To make our model more
efficient and real
life applicable.
To increase the accuracy of
the model by using
ResNet50 CNN model.
Problem Statement
• Preschool education plays a crucial role in the cognitive, social, and emotional development of young children. However, it often faces
challenges related to accessibility, quality, and individualized learning experiences. Presence of adequate number of teachers and innovative
teaching methodology has also become difficult to find. According to various reports, the gross enrolment ratio (GER) for preschool
education in India is low, with significant disparities between urban and rural areas.
• The quality of preschool education is a major concern. Many preschools lack trained teachers and age-appropriate curriculum, leading to
variations in the educational experience provided to young children. Insufficient infrastructure, including the lack of proper facilities,
materials, and play equipment in preschools, affects the quality of education provided.
• Teachers are not able to focus on every child due to lack of time and increased number of children in a class.
• Therefore, there is a need for innovation and to address these issues, there is a growing interest in leveraging technology, specifically
machine learning, to enhance and personalize preschool education.
Methodology
Understanding the CNN
CNN (convolution Neural network) is, Neural Network are a subset of machine learning, and they are at the heart of deep learning algorithms.
They are comprised of node layers, containing an input layer, one or more hidden layers, and an output layer. Each node connects to another
and has an associated weight and threshold.
a). Convolution Layer: The convolutional layer is the core building block of a CNN, and it is where most of the computation occurs. It requires
a few components, which are input data, a filter, and a feature map. Let’s assume that the given input will be colour image, which is made up of
a matrix of pixels in 3D. This means the input will have 3 dimensions— height, width, and depth—which correspond to RGB in an image. We
also have a feature detector, also known as a kernel or a filter, which will move across the receptive fields of the image [2], checking if the
feature is present. This process is known as a convolution.
• The final output from the series of dot products from the input and the filter is known as a feature map, activation map, or a convolved
feature. the weights in the feature detector remain fixed as it moves across the image, which is also known as parameter sharing.
• A parameter sharing scheme is used in convolutional layers to control the number of free parameters. It relies on the assumption that if a
patch feature is useful to compute at some spatial position, then it should also be useful to compute at other positions. Denoting a single 2-
dimensional slice of depth as adepth slice, the neurons in each depth slice are constrained to use the same weights and bias.
Sometimes, the parameter sharing assumption may not make sense. This is especially the case when the input images to a CNN have some
specific centred structure; for which we expect completely different features to be learned on different spatial locations. One practical
example is when the inputs are faces that have been centred in the image: we might expect different eye-specific or hair-specific features to
be learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a
"locally connected layer”.
These include:
1. Number of filters that affects the depth of output, for example three distinct filters provides different features maps.
2.Strides is the distance, or number of pixels the kernel moves over the input matrix. While stride values of two or greater is rare, a larger
stride yields a smaller output.
3.Zero Padding used when filter don’t fit the input image. This sets all that fall outside of the input matrix to zero, producing a larger or
equally sized output. There are three types of padding:
Valid padding:This is also known as no padding. In this case, the last convolution is dropped if dimensions do not align.
Same padding:This padding ensures that the output layer has the same size as the input layer.
• After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the feature map, introducing
nonlinearity to the model.
this image represents the image
convolution with filter.
Illustration for how a convolutional
layer operates.
ReLu activation function makes the network converge much faster as it does not saturate (when x > 0) and is computationally
efficient. It is defined as, f(x) = max (0, x) but it suffers from a drawback, when x < 0, during forward pass neurons remain inactive
and weights are not updated during backpropagation. As a result, the network does not learn. Hence, we use leaky ReLu [8] which is
defined as
b). Pooling Layer: Pooling layer is used to reduce the size of the image along with keeping the important parameters in role. Thus, it helps to
reduce the computation in the model.
1.Max Pooling we narrow down the scope and of all the features, the most important features are only considered. Thus, the problem is solved.
Pooling is done in two ways Average Pooling or Max Pooling. Max Pooling is generally used [5]. It will only consider the maximum value
within a kernel filter as per the size of strides and enhance the edge detection i.e., vertical and horizontal edge. It will reduce the image size and
make the computational faster.
But max pooling technique is not used in case of large data set as it will reduce the size of image significantly and thus change the image
parameter.
• 2.Average Pooling: It is a technique of reducing the size of the feature maps produced by the convolutional layers in a convolutional neural network
(CNN). It works by dividing the feature map into non-overlapping regions of a fixed size, usually 2x2 or 3x3, and taking the average value of the pixels
within each region as the output. This way, average pooling summarizes the average presence of a feature in a region of the feature map and discards the
less relevant information. Moreover, average pooling provides some degree of translation invariance to the CNN, meaning that the output does not change
much if the input image is slightly shifted or rotated.
c). Fully connected : A fully connected layer in a convolutional neural network
(CNN) is a layer where each neuron is connected to every neuron in the previous
layer. This means that the input of a fully connected layer is a vector that contains
all the values from the output of the previous layer. A fully connected layer
performs a linear transformation on the input vector, followed by a non-linear
activation function, such as sigmoid, tanh, or ReLU.
To illustrate how a fully connected layer works, let’s consider an example of a CNN
that classifies images into 10 categories. Suppose that the input image has a size of
32 x 32 x 3 (width x height x depth), and it passes through several convolutional
and pooling layers that
produce an output of size 4 x 4 x 16. This output is then flattened into a vector of
size 256 (4 x 4 x 16), which is the input of the first fully connected layer. The first
fully connected layer has 64 neurons, so it has a weight matrix of size 256 x 64 and
a bias vector of size 64. The output of this layer is another vector of size 64, which
is obtained by multiplying the input vector by the weight matrix, adding the bias
vector, and applying an activation function. The output vector of the first fully
connected layer is then fed into the second fully connected layer, which has 10
neurons, corresponding to the 10 classes. The second fully connected layer has a
weight matrix of size 64 x 10 and a bias vector of size 10. The output of this layer is
another vector of size 10, which represents the class scores or probabilities for each
class. The final output of the network is obtained by applying a SoftMax function to
this vector, which normalizes it to sum up to one.
the neuron applies a linear transformation to the input vector through a weights
matrix. A non-linear transformation is then applied to the product through a non-
linear activation functionf.
This Illustrates how a CNN recognize an object along with detecting of
different features of object here it is koala .
And making of a Feature map through convolution layer with a grid of
filters ..
Then again passes to more sophisticated filter to detect more general feature
of object .
Afterward convert the 3D convolution into a 1D Convolution Array to make a
fully connected dense neural network .
That dense neural network will help in detecting different variety of object
with similar features to classify variety of object in more generic way..
ReLu Is used to make feature more non linear to make it more generic and
handle overfitting .This is how the CNN model functions
Training Model
• Convolutional neural networks (CNN) are a type of deep
learning that use convolutional operations rather than
standard multiplication. It has been frequently employed in
classification problems recently, particularly in image
recognition, and can automatically extract discriminant
characteristics through the training process (CNN
Attributes).
• ResNet-50 is based on a deep residual learning framework
that allows for the training of very deep networks with
hundreds of layers.
• ResNet-50 consists of 50 layers that are divided into 5
blocks, each containing a set of residual blocks. The residual
blocks allow for the preservation of information from earlier
layers, which helps the network to learn better
representations of the input data.
Architecture of Model
Architecture Of ResNet50
• The 50-layer ResNet architecture includes the following elements, as
shown in the table below:
• A 7×7 kernel convolutionalongside 64 other kernels with a 2-sized
stride.
• A max pooling layerwith a 2-sized stride.
• 9 more layers—3×3,64 kernel convolution, another with 1×1,64
kernels, and a third with 1×1,256 kernels. These 3 layers are
repeated 3 times.
• 12 more layerswith 1×1,128 kernels, 3×3,128 kernels, and 1×1,512
kernels, iterated 4 times.
• 18 more layerswith 1×1,256 cores, and 2 cores 3×3,256 and
1×1,1024, iterated 6 times.
• 9 more layerswith 1×1,512 cores, 3×3,512 cores, and 1×1,2048
cores iterated 3 times.
• (up to this point the network has 50 layers)
• Average pooling, followed by a fully connected layer with 1000
nodes, using the softmax activation function.
Results
• The aim of the proposed method is getting
accurate objects recognised with least losses and
max accuracy. In light of this, we measured
accuracy and loss. The performance indicator of
accuracy is explained below and provides the
percentage of successfully anticipated
observations to all observations.
• Fetching all the losses from 10 epochs, we find
the final loss and get the accuracy,
• So, the observation is that the pre-trained model
when trained with Neural Network gave an
accuracy of around 31.6% and when that pre-
trained model is trained with ResNet50 the
accuracy increases to 71.7%.
Fig-1 Accuracy after 10 epochs Fig-2 Loss after 10 epochs
• Fig 2 shows the loss of data during the training and validation process. Loss function denotes the degree of error while making predictions
during training and testing process.
• Fig 1 shows the accuracy of data during the training and validation process. Fig [1,2] shows the model performance in the form of accuracy
and loss with respect to iterations. The output demonstrates the model’s accuracy after 10 iterations.
• After each iteration, the training performance progressively got better and stayed stable.
• In the validation set for the classification of objects (CIFAR-10 dataset), an averaged accuracy of 71% was attained after 10 iterations, or
epochs. With each iteration, the loss likewise decreased, entire model took 41 seconds to predict the labels correctly.
Future Work To Be Done
Increasing the efficiency of our model
from 71.3% to somewhere around
90%.
Adding some more features like
displaying some short easy interactive
text along with images to aide learning.
Adding more relevant classes to
existing dataset(CIFAR-10) to make our
model more real life and usable.
THANK YOU

machine learning based object recognition using cifar-10

  • 1.
    MACHINE LEARNING BASEDOBJECT RECOGNITION USING CIFAR-10 Made by: • 2020UEC2528 MAYANK DEWAN • 2020UEC2539 RUDRAKSH SHARMA • 2020UEC2559 GAURAV RAJORA Under the supervision of: Dr. Satya Prakash Singh
  • 2.
    CONTENTS • Introduction • LiteratureSurvey • Gaps Identified • Objectives Of The Project • Problem Statement • Solution Methodology • Results • Future Work To Be Done • Expected Outcomes Of The Work
  • 3.
    Introduction Preschool education, alsoknown as early childhood education or Kindergarten is typically the first stage of formal education that children receives. Its primary focus is on preparing young learners for a smooth transition into more structured and formal education. It holds immense significance in a child's development, providing a strong foundation for lifelong learning and success. Key aspects of education in kindergarten include: • Introduction to Academic Basics: Kindergarten introduces children to basic academic concepts such as letters, numbers, shapes, and colours. It aims to familiarize them with the fundamental building blocks of reading, writing, and math. • Language and Literacy Development: Kindergarteners are exposed to language development activities that include vocabulary building, listening skills, and basic phonics. They may begin to recognize and write letters and words. • Social and Emotional Skills: Kindergarten is a significant stage for the development of social and emotional skills. Children learn how to interact with their peers, share, take turns, and express their feelings in appropriate ways. Machine learning can be applied in various ways to enhance and support preschool learning, benefiting both educators and young learners. One such application is image classification which can help young learners to learn about objects like scale, pencil, ball, newspaper, books, animals, pets and learn about them. User-friendly interfaces, colourful visuals can aide learning and make it interactive and easy.
  • 4.
    Object Recognition isa computer vision task that involves identifying and classifying objects within an image or a video stream. The goal of object recognition is to determine what objects are present in the scene and assign them to predefined categories or classes. It outputs the names or labels of the objects present in the image or video. Object Detection, on the other hand, is a related computer vision task that goes beyond object recognition. Object detection not only identifies the objects within an image or video but also provides information about their location or spatial extent in the image. It outputs both the object labels and the coordinates of bounding boxes around each detected object. Image classification using the CIFAR-10 dataset is a common computer vision task. CIFAR-10 is a dataset containing 60,000 32x32 colour images in 10 different classes, with 6,000 images per class. The objective is to train a machine learning or deep learning model to classify these images into their respective categories. Convolutional Neural Networks (CNNs) are a powerful and widely used architecture for image classification tasks involving the CIFAR-10 dataset. CNNs automatically learn to extract hierarchical features from images. They use convolutional layers to detect edges, textures, and more complex patterns in the images, which is essential for recognizing objects in CIFAR-10. CNNs are designed to capture spatial hierarchies in images. Lower layers detect basic features like edges, while higher layers combine these features to recognize more complex objects or patterns. CNNs use weight sharing in convolutional layers, which significantly reduces the number of parameters in the model. This makes it feasible to train deep networks without an impractical number of parameters. CNNs exhibit translation invariance, meaning they can recognize patterns regardless of their position in the image. This is crucial for classifying objects that can appear anywhere in a picture. Data augmentation techniques, such as image rotation, scaling, and flipping, can be easily applied to image data, increasing the diversity of the training dataset and improving model generalization.
  • 5.
    Literature Survey A largenumber of articles and research papers have been studied to gather the information regarding convolutional neural network techniques including IEEE research papers. Different image recognition models and different libraries have also been searched on Kaggle to decide the best model suitable for us. Datasets like CIFAR 10, CIFAR 100 and VGG16 were studied thoroughly to look for their advantages and disadvantages over one another. Different CNN models were considered, and ResNet-50 is utilised along with CIFAR 10. MODELS REFFERED 1. "Learning Multiple Layers of Features from Tiny Images" • Author: Alex Krizhevsky • Description: This paper introduced the CIFAR-10 dataset and presented a deep convolutional neural network (CNN) that achieved state-of-the-art performance at the time. It laid the foundation for using deep learning in image classification. Abstract: The paper primarily focuses on using deep CNNs for image classification and introduces the CIFAR-10 dataset as a benchmark for evaluating these models. The CIFAR-10 dataset comprises 60,000 32x32 color images in ten different classes, making it a suitable choice for testing deep learning architectures.
  • 6.
    2. "Very DeepConvolutional Networks for Large-Scale Image Recognition" • Authors: Karen Simonyan, Andrew Zisserman • Description: This paper introduced the VGGNet architecture, which significantly deepened the networks used for image classification. It has had a major influence on the design of deep CNNs for CIFAR-10 and other datasets. • Abstract: The paper focuses on the development of deep convolutional neural network (CNN) architectures and their application to large-scale image recognition. The authors present a network called VGG (Visual Geometry Group)Net, characterized by its depth and uniformity, which makes it effective for various computer vision tasks, including image classification. 3."Residual Networks" Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al. Description: This paper introduced ResNet, a deep neural network architecture with residual connections. ResNet revolutionized image classification by enabling the training of very deep networks, leading to improved performance on CIFAR-10 and other datasets. Abstract: The paper addresses the problem of training very deep neural networks, which are prone to vanishing or exploding gradients during training. To mitigate these issues, the authors propose the use of residual connections, which allow for the training of networks with hundreds or even thousands of layers. The resulting architecture is called a Residual Network, or ResNet.
  • 7.
    Gaps Identified Preschool educationplays a critical role in a child's early development and lays the foundation for their future academic and social success. However, like any educational system, pre schooling have gaps or shortcomings that can be identified and analysed . Here are some common gaps and shortcomings in preschool education - 1. Classroom Size: The student-to-teacher ratio is a critical factor in the quality of early education. Statistics can show if classrooms are overcrowded, which can negatively impact individualized attention and learning outcomes. 2. Quality Of Teaching: Many preschools lack trained teachers and age-appropriate curriculum, leading to variations in the educational experience provided to young children. 3. Access Disparities: Many children, especially those from disadvantaged backgrounds, lack access to quality preschool programs. Limited availability of preschools and financial barriers can result in unequal access to early education. 4. Curriculum Alignment: The curriculum in some preschool programs may not be developmentally appropriate or aligned with best practices in early childhood education, impacting children's readiness for kindergarten. There is also a lack of proper ML model that focuses on improvement of elemtary education that utilises object classfication using datasets.
  • 8.
    Objective Of TheProject To make a ML model which can classify/recognise objects using a pre trained datasets. To make our model more efficient and real life applicable. To increase the accuracy of the model by using ResNet50 CNN model.
  • 9.
    Problem Statement • Preschooleducation plays a crucial role in the cognitive, social, and emotional development of young children. However, it often faces challenges related to accessibility, quality, and individualized learning experiences. Presence of adequate number of teachers and innovative teaching methodology has also become difficult to find. According to various reports, the gross enrolment ratio (GER) for preschool education in India is low, with significant disparities between urban and rural areas. • The quality of preschool education is a major concern. Many preschools lack trained teachers and age-appropriate curriculum, leading to variations in the educational experience provided to young children. Insufficient infrastructure, including the lack of proper facilities, materials, and play equipment in preschools, affects the quality of education provided. • Teachers are not able to focus on every child due to lack of time and increased number of children in a class. • Therefore, there is a need for innovation and to address these issues, there is a growing interest in leveraging technology, specifically machine learning, to enhance and personalize preschool education.
  • 10.
    Methodology Understanding the CNN CNN(convolution Neural network) is, Neural Network are a subset of machine learning, and they are at the heart of deep learning algorithms. They are comprised of node layers, containing an input layer, one or more hidden layers, and an output layer. Each node connects to another and has an associated weight and threshold. a). Convolution Layer: The convolutional layer is the core building block of a CNN, and it is where most of the computation occurs. It requires a few components, which are input data, a filter, and a feature map. Let’s assume that the given input will be colour image, which is made up of a matrix of pixels in 3D. This means the input will have 3 dimensions— height, width, and depth—which correspond to RGB in an image. We also have a feature detector, also known as a kernel or a filter, which will move across the receptive fields of the image [2], checking if the feature is present. This process is known as a convolution. • The final output from the series of dot products from the input and the filter is known as a feature map, activation map, or a convolved feature. the weights in the feature detector remain fixed as it moves across the image, which is also known as parameter sharing. • A parameter sharing scheme is used in convolutional layers to control the number of free parameters. It relies on the assumption that if a patch feature is useful to compute at some spatial position, then it should also be useful to compute at other positions. Denoting a single 2- dimensional slice of depth as adepth slice, the neurons in each depth slice are constrained to use the same weights and bias.
  • 11.
    Sometimes, the parametersharing assumption may not make sense. This is especially the case when the input images to a CNN have some specific centred structure; for which we expect completely different features to be learned on different spatial locations. One practical example is when the inputs are faces that have been centred in the image: we might expect different eye-specific or hair-specific features to be learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a "locally connected layer”. These include: 1. Number of filters that affects the depth of output, for example three distinct filters provides different features maps. 2.Strides is the distance, or number of pixels the kernel moves over the input matrix. While stride values of two or greater is rare, a larger stride yields a smaller output. 3.Zero Padding used when filter don’t fit the input image. This sets all that fall outside of the input matrix to zero, producing a larger or equally sized output. There are three types of padding: Valid padding:This is also known as no padding. In this case, the last convolution is dropped if dimensions do not align. Same padding:This padding ensures that the output layer has the same size as the input layer. • After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the feature map, introducing nonlinearity to the model. this image represents the image convolution with filter.
  • 12.
    Illustration for howa convolutional layer operates. ReLu activation function makes the network converge much faster as it does not saturate (when x > 0) and is computationally efficient. It is defined as, f(x) = max (0, x) but it suffers from a drawback, when x < 0, during forward pass neurons remain inactive and weights are not updated during backpropagation. As a result, the network does not learn. Hence, we use leaky ReLu [8] which is defined as
  • 13.
    b). Pooling Layer:Pooling layer is used to reduce the size of the image along with keeping the important parameters in role. Thus, it helps to reduce the computation in the model. 1.Max Pooling we narrow down the scope and of all the features, the most important features are only considered. Thus, the problem is solved. Pooling is done in two ways Average Pooling or Max Pooling. Max Pooling is generally used [5]. It will only consider the maximum value within a kernel filter as per the size of strides and enhance the edge detection i.e., vertical and horizontal edge. It will reduce the image size and make the computational faster. But max pooling technique is not used in case of large data set as it will reduce the size of image significantly and thus change the image parameter.
  • 14.
    • 2.Average Pooling:It is a technique of reducing the size of the feature maps produced by the convolutional layers in a convolutional neural network (CNN). It works by dividing the feature map into non-overlapping regions of a fixed size, usually 2x2 or 3x3, and taking the average value of the pixels within each region as the output. This way, average pooling summarizes the average presence of a feature in a region of the feature map and discards the less relevant information. Moreover, average pooling provides some degree of translation invariance to the CNN, meaning that the output does not change much if the input image is slightly shifted or rotated.
  • 15.
    c). Fully connected: A fully connected layer in a convolutional neural network (CNN) is a layer where each neuron is connected to every neuron in the previous layer. This means that the input of a fully connected layer is a vector that contains all the values from the output of the previous layer. A fully connected layer performs a linear transformation on the input vector, followed by a non-linear activation function, such as sigmoid, tanh, or ReLU. To illustrate how a fully connected layer works, let’s consider an example of a CNN that classifies images into 10 categories. Suppose that the input image has a size of 32 x 32 x 3 (width x height x depth), and it passes through several convolutional and pooling layers that produce an output of size 4 x 4 x 16. This output is then flattened into a vector of size 256 (4 x 4 x 16), which is the input of the first fully connected layer. The first fully connected layer has 64 neurons, so it has a weight matrix of size 256 x 64 and a bias vector of size 64. The output of this layer is another vector of size 64, which is obtained by multiplying the input vector by the weight matrix, adding the bias vector, and applying an activation function. The output vector of the first fully connected layer is then fed into the second fully connected layer, which has 10 neurons, corresponding to the 10 classes. The second fully connected layer has a weight matrix of size 64 x 10 and a bias vector of size 10. The output of this layer is another vector of size 10, which represents the class scores or probabilities for each class. The final output of the network is obtained by applying a SoftMax function to this vector, which normalizes it to sum up to one. the neuron applies a linear transformation to the input vector through a weights matrix. A non-linear transformation is then applied to the product through a non- linear activation functionf.
  • 16.
    This Illustrates howa CNN recognize an object along with detecting of different features of object here it is koala . And making of a Feature map through convolution layer with a grid of filters .. Then again passes to more sophisticated filter to detect more general feature of object . Afterward convert the 3D convolution into a 1D Convolution Array to make a fully connected dense neural network . That dense neural network will help in detecting different variety of object with similar features to classify variety of object in more generic way.. ReLu Is used to make feature more non linear to make it more generic and handle overfitting .This is how the CNN model functions
  • 17.
    Training Model • Convolutionalneural networks (CNN) are a type of deep learning that use convolutional operations rather than standard multiplication. It has been frequently employed in classification problems recently, particularly in image recognition, and can automatically extract discriminant characteristics through the training process (CNN Attributes). • ResNet-50 is based on a deep residual learning framework that allows for the training of very deep networks with hundreds of layers. • ResNet-50 consists of 50 layers that are divided into 5 blocks, each containing a set of residual blocks. The residual blocks allow for the preservation of information from earlier layers, which helps the network to learn better representations of the input data. Architecture of Model
  • 18.
    Architecture Of ResNet50 •The 50-layer ResNet architecture includes the following elements, as shown in the table below: • A 7×7 kernel convolutionalongside 64 other kernels with a 2-sized stride. • A max pooling layerwith a 2-sized stride. • 9 more layers—3×3,64 kernel convolution, another with 1×1,64 kernels, and a third with 1×1,256 kernels. These 3 layers are repeated 3 times. • 12 more layerswith 1×1,128 kernels, 3×3,128 kernels, and 1×1,512 kernels, iterated 4 times. • 18 more layerswith 1×1,256 cores, and 2 cores 3×3,256 and 1×1,1024, iterated 6 times. • 9 more layerswith 1×1,512 cores, 3×3,512 cores, and 1×1,2048 cores iterated 3 times. • (up to this point the network has 50 layers) • Average pooling, followed by a fully connected layer with 1000 nodes, using the softmax activation function.
  • 19.
    Results • The aimof the proposed method is getting accurate objects recognised with least losses and max accuracy. In light of this, we measured accuracy and loss. The performance indicator of accuracy is explained below and provides the percentage of successfully anticipated observations to all observations. • Fetching all the losses from 10 epochs, we find the final loss and get the accuracy, • So, the observation is that the pre-trained model when trained with Neural Network gave an accuracy of around 31.6% and when that pre- trained model is trained with ResNet50 the accuracy increases to 71.7%.
  • 20.
    Fig-1 Accuracy after10 epochs Fig-2 Loss after 10 epochs • Fig 2 shows the loss of data during the training and validation process. Loss function denotes the degree of error while making predictions during training and testing process. • Fig 1 shows the accuracy of data during the training and validation process. Fig [1,2] shows the model performance in the form of accuracy and loss with respect to iterations. The output demonstrates the model’s accuracy after 10 iterations. • After each iteration, the training performance progressively got better and stayed stable. • In the validation set for the classification of objects (CIFAR-10 dataset), an averaged accuracy of 71% was attained after 10 iterations, or epochs. With each iteration, the loss likewise decreased, entire model took 41 seconds to predict the labels correctly.
  • 21.
    Future Work ToBe Done Increasing the efficiency of our model from 71.3% to somewhere around 90%. Adding some more features like displaying some short easy interactive text along with images to aide learning. Adding more relevant classes to existing dataset(CIFAR-10) to make our model more real life and usable.
  • 22.