human action recognition with CNN is a thesis paper based on background reduction using maskrcnn and by using 3D cNN we can evaluate the result in two base model which is restnet50 and vgg16.

1
Research Paper
Presentation Based On
Human Action Recognition

2
Advisor
Dr. Md. Abu Layek
Associate Professor
Department of Computer Science and Engineering
Jagannath University
Md Monirul Islam
ID: B170305034
Department of Computer Science
& Engineering
monirulshahinme2@gmail.com
Shazid Ahmed Rajib
ID: B170305049
Department of Computer Science &
Engineering
shazidahmed159@gmail.com
Human AcHtion Recognition with Background substraction
and 3D CNN

3
Evaluations and results
Introduction
Problem Statement
Motivation
Proposed Solution
Background Study
CNN Architecture
VGG16
ResNet
Methodology
Tools
Proposed
Methodology
Conclusion & Possible Improvements
Summary
Limitations & Future
Literature Review
Materials

4
Introduction Literature Review CNN Architecture Materials Methodology Evaluation Conclusion
As described by the author, The reason for the lower accuracy is that
some of the background elements in these classes are the same,
hence our goal is to eliminate the background elements using pre-
processing techniques.

5

6
How deep learning influence to detect Human Action
recognition?
- Feature Extraction: It automates the extraction of relevant
features from raw data, which is crucial for recognizing human
actions.
- Neural Networks: Utilizes complex neural networks capable of
processing large volumes of video data to identify intricate action
patterns.
- Spatial-Temporal Analysis: Employs models like CNNs and RNNs
to capture spatial and temporal dependencies, thereby improving
recognition accuracy.

7
• Less accuracy in few classes (Biking,Swing,Walking with Dog )
• Because of same background elements
• Low input resolution.
1. Clear the background noise as much as possible.
2. Develop an automatic Background remove system to fasten
the process.
Solution

8
1. HAR is a significant challenge for various
reason
2. Usage of cameras has expanded
3. Identify any kind of crime or violence

9
Data Preprocessing
Data Background
Noise Redution
Multiple CNN
Architecture
Result

10
• Deep learning is a subfield of machine learning based on ANN(Artificial Neural Network).
Neural
Network
Shallow neural
network
Deep neural network
It consist
• input layer
• one hidden layer
• output layer
It consist
• input layer
• More than one hidden
layer
• output layer

11
• In deep learning the hidden units in hidden layers act like biological neuron.
• Each hidden unit called neuron
• It takes inputs from input layer and then process these inputs in each hidden
units to make a sense or decision and then transfer the outputs from one hidden
layer to other hidden layers.

12
• In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of
deep neural networks, most commonly applied to analyze visual imagery.
• In CNN model , it consists three types of layer
• Convolutional layer
• Polling layer
• Fully Connected layer

13
• Convolutional layer:
• Convolutional layers convolve the input and pass its result to the next layer.
• This layer extracts the feature with various kernel / filter.
• The objective of the Convolution Operation is to extract the high-level
features such as edges from the input image.
• The first ConvLayer is responsible for capturing the Low-Level features such
as color, gradient orientation, etc. With added layers, the architecture adapts
to the High-Level features

14
• Convolutional layer:

15
• Pooling layer:
• Pooling layer is responsible for reducing the spatial size of the Convolved
Feature.
• Decrease the computational power required to process the data through
dimensionality reduction.
• There are two types of Pooling
1. Max Pooling and
2. Average Pooling

16
Introduction
Problem Statement
Motivation
Proposed Solution
Background Study
CNN Architecture
VGG16
ResNet
Methodology
Tools
Proposed
Methodology
Summary
Literature Review
Materials

17
Reference Contribution Drawback Key Contribution
Performance
Comparison of
ResNet50V2 and
VGG16 Models
for Feature
Extraction in
Deep Learning
The study aimed to compare the
performance of ResNet50V2 and
VGG16 for feature extraction in image
classification tasks.
• The paper suggests that while
both models are effective,
VGG16 may be less efficient
due to slower convergence
and lower accuracy in certain
tasks.
ResNet50V2
outperformed
VGG16, exhibiting
faster convergence
and achieving
higher accuracy in
the context of
masked face
recognition.
Human Action
Recognition from
Various Data
Modalities
The paper reviews the use of various
data modalities in HAR, including
the application of ResNet and
VGG16.
The review does not provide a
direct comparison between the
models.
It highlights the
importance of
multimodal data
for improving the
accuracy of HAR
systems.
Introduction Literature Review CNN Architecture Materials Evaluation Conclusion

18
Reference Contribution Drawback Key contribution
Modern architectures
convolutional neural
networks in human
activity recognition
Discusses the role of modern CNN
architectures like ResNet and
VGG16 in HAR
• Specific drawbacks
of each model in the
context of HAR are
not detailed.
Emphasizes the
advancements in CNN
architectures that enhance
HAR performance.

19
Introduction
Problem Statement
Motivation
Proposed Solution
Background Study
CNN Architecture
VGG16
ResNet
Methodology
Tools
Proposed
Methodology
Summary
Limitations & Future Directions
Literature Review
Materials

20
• Here, we have used some CNN architecture.
• VGG-16
• ResNet-50
• These architectures are success in competitions - the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC).
evaluates algorithms for
object detection and
image classification at
large scale

21
VGG16(Visual Geometry Group) :
• VGG16 is developed by oxford
and win the ILSVR (ImageNet)
competition in 2014.
• It has 16 layers.
Layers Label Layers Quantity
Convolutional layer 13
Fully Connected
layer
3
Total 16

22
ResNet 50:
• In 2015 ResNet was the winner
of ImageNet challenge.
• In the ResNet 50 contains 50
layers.

23
Introduction
Problem Statement
Motivation
Proposed Solution
Background Study
CNN Architecture
VGG16
ResNet
Methodology
Tools
Proposed
Methodology
Summary
Limitations & Future Directions
Literature Review
Materials

24
• ImageNet dataset
has more than 15
million labeled
images belonging
22,000 categories.
Pre-trained
dataset
• Keras Deep learning
frameworks used
which is open-
source library
written on python.
Framework
• ReLU (Rectified
Linear Units) non-
linear function
activity Function .
Activity
Function

25
Introduction
Problem Statement
Motivation
Proposed Solution
Background Study
CNN Architecture
VGG16
ResNet
InceptionV3
Methodology
Tools
Proposed
Methodology
Conclusion
Summary
Directions
Literature Review
Materials

26
Tools
CPU 64 bit
RAM 32 GB
Operating System Windows 11
Programming
Language
Python
H/W And S/W Requirements

27
• Data are collected from Kaggle’s data repository .
• This dataset is composed a set of 101 subjects.
• we will be using the UCF101 dataset.
• It has 101 classes of human action where each of the
classes contains more than 100 videos on average.
• The frames will be extracted from our dataset, and any
background elements will be removed before we begin
processing the data.
• Furthermore, we will maintain the 224*224 resolution
of the images.

28
• Background subtraction by MaskRCNN
• Extracting Frames
• Training the frames in ResNet CNN

29
Background subtraction using MaskRCNN

30
ResNet Model

31
One of the first things we did after gathering the data
was to extract images from each video. After that, we
removed the background, taking into account only the
most crucial components that were required for the
detection of a certain object.

32
Introduction
Problem Statement
Motivation
Proposed Solution
Background Study
CNN Architecture
VGG16
ResNet
Methodology
Tools
Proposed
Methodology
Summary
Literature Review
Materials

33
• 80% Training Testing Accuracy
• More Than 90% accuracy in new videos
• Background element was the issue
Training Accuracy vs Testing Accuracy And Training Loss vs Testing Loss Of VGG16

34

35
Training Accuracy vs Testing Accuracy And Training Loss vs Testing Loss Of ResNet50

36

37
Used Model Accuracy Precision Recall F-1 Score
ResNet 93.93% 95% 93% 94%
VGG-16 51.68% 47% 56% 52%

38
Introduction
Problem Statement
Motivation
Proposed Solution
Background Study
CNN Architecture
VGG16
ResNet
Methodology
Tools
Proposed
Methodology
Summary
Literature Review
Materials

39
• Same approach can be implemented in various video classification problem
Limitations
• Lack of original large dataset with variety of subjects.
• Study depends on only built-in CNN architectures.

40
Future Directions
• Custom Object Detection needed
• CNN+LSTM Model can be implemented further.
• Pose estimation values can be added in the model

41
Bibliography
1. T. Lima, B. Fernandes and P. Barros, "Human action recognition with 3D convolutional neural
network," 2017 IEEE Latin American Conference on Computational Intelligence (LA-CCI),2017, pp.
1-6, doi: 10.1109/LA-CCI.2017.8285700.
2. Saoudi, E.M., Jaafari, J. and Andaloussi, S.J., 2023. Advancing human action recognition: A hybrid
approach using attention-based LSTM and 3D CNN. Scientific African, 21, p.e01796.
3. de la Torre Frade, F., MARTINEZ MARROQUIN, E., SANTAMARIA PEREZ, M.E. and MORAN MORENO,
J.A., 1997. Moving object detection and tracking system: a real-time implementation.
4. LeCun, Y. and Bengio, Y., 1995. Convolutional networks for images, speech, and time series. The
handbook of brain theory and neural networks, 3361(10), p.1995.
5. Li, Liyuan, Weimin Huang, Irene YH Gu, and Qi Tian. "Foreground object detection from videos
containing complex background." In Proceedings of the eleventh ACM international conference on
Multimedia, pp. 2-10. 2003.
6. Zhou, Q., 2001. Tracking and classifying moving objects from videos. In Proc. 2nd IEEE Workshop
on Performance Evaluation of Tracking and Surveillance, 2001.
7. Pham, H.H., Khoudour, L., Crouzil, A., Zegers, P. and Velastin, S.A., 2022. Video-based human action
recognition using deep learning: a review. arXiv preprint arXiv:2208.03775.
8. Yang, C., Mei, F., Zang, T., Tu, J., Jiang, N. and Liu, L., 2023. Human Action Recognition Using Key-
Frame Attention-Based LSTM Networks. Electronics, 12(12), p.2622.

42
CREDITS: This presentation template was created by Slidesgo, and
includes icons by Flaticon, and infographics & images by Freepik
THANKS!

human action recognition with CNN is a thesis paper based on background reduction using maskrcnn and by using 3D cNN we can evaluate the result in two base model which is restnet50 and vgg16.

Recommended

Recommended

More Related Content

Similar to human action recognition with CNN is a thesis paper based on background reduction using maskrcnn and by using 3D cNN we can evaluate the result in two base model which is restnet50 and vgg16.

Similar to human action recognition with CNN is a thesis paper based on background reduction using maskrcnn and by using 3D cNN we can evaluate the result in two base model which is restnet50 and vgg16. (20)

Recently uploaded

Recently uploaded (20)

human action recognition with CNN is a thesis paper based on background reduction using maskrcnn and by using 3D cNN we can evaluate the result in two base model which is restnet50 and vgg16.

Editor's Notes