1. Abstract
Interest in the area of vision-based surveillance is growing rapidly with the continuous
evolution of Computer Vision technologies. A number of applications in security guard for
communities and important buildings, traffic surveillance in cities and expressways, detection
of military targets, detection of anomalous behaviors etc. are expected to use Vision-based
surveillance system. In order to make use of this vision-based surveillance system, efficient
and effective techniques to analyze and extract feature information from image sequences are
required to be developed. This thesis is dedicated to the finding a solution that can be
integrated to an existing closed circuit television systems by using an intelligent algorithm
that can detect unusual activity and alert human operator in real time with the help of Human
Activity Recognition (HAR).
In human activity recognition on video surveillance, there is a wide range of
applications taken over the function with a different activity. They are three methods, namely
physical method, partial self-determination, and entirely autonomous structure. In the
physical method, the creature itself performs the examinations, which give the video as an
input, which by the removal of background activities and the particular object alone with the
action is taken. Then by following the movement of the object and recognition of their
particular activity with the final decision is formed. At the partial self-determination, the
input video is breaking down with processing and then given as free from any intervention.
Automated human activity analysis has been, and remains, a challenging problem. Security
and surveillance are essential issues in today's world. Any behavior which is uncommon in
occurrence and deviates from customarily understood action could be termed as suspicious.
For different application regions, while identifying human exercises, fundamentally three
angles are taking in worry for human movement recognition system: Segmentation, feature
extraction, and activity classification. This model aimsat the automatic detection of abnormal
behavior in surveillance videos.
This research work has three stages. The first stage of work is focused for automatic
human activity recognition in video surveillance system for healthcare monitoring. With a
special focus on elderly patientsβ care, safety arrangements and supervision areas and in
applications designed for smart homes. Sensor and visual devices enable HAR, and there is a
multitude of sensor classifications, such as sensors that can be worn, sensors tagged to a
target and sensors tagged to the background. The automated learning methodologies in HAR
2. are either handcrafted or deep learning or a combination of both. Handcrafted models can be
regional or wholesome recognition models such as RGB, 3D mapping and skeleton data
models, and deep learning models are categorized into generative models such as LSTM
(long short-term memory), discriminative models such as convolutional neural networks
(CNNs) or a synthesis of such models. Several datasets are available for undertaking HAR
analysis and representation. The hierarchy of processes in HAR is classified into gathering
information, preliminary processing, property derivation and guiding based on framed
models. The proposed study considers the role of smartphones in HARs with a particular
interest in keeping a tab on the lifestyle of subjects. Smartphones act as HAR devices with
inbuilt sensors with custom-made applications, and the merits of both handcrafted and deep
learning models are considered in framing a model that can enable lifestyle tracking in real
time. This performance-enhanced real-time tracking human activity recognition (PERT-HAR)
model is economical and effective in accurate identification and representation of actions of
the subjects and thereby provides more accurate data for real-time investigation and remedial
measures. This model achieves an accuracy of 97β99% in a properly controlled environment.
Despite the benefits of HPE, it is still a challenging process due to the variations in
visual appearances, lighting, occlusions, dimensionality, etc. To resolve these issues, the
second research work presents a squirrel search optimization with a deep convolutional
neural network for HPE (SSDCNN-HPE) technique. The major intention of the SSDCNN-
HPE technique is to identify the human pose accurately and efficiently. Primarily, the video
frame conversion process is performed and pre-processing takes place via bilateral filtering-
based noise removal process. Then, the EfficientNet model is applied to identify the body
points of a person with no problem constraints. Besides, the hyperparameter tuning of the
EfficientNet model takes place by the use of the squirrel search algorithm (SSA). In the final
stage, the multiclass support vector machine (M-SVM) technique was utilized for the
identification and classification of human poses. The design of bilateral filtering followed by
SSA based EfficientNet model for HPE depicts the novelty of the work. To demonstrate the
enhanced outcomes of the SSDCNN-HPE approach, a series of simulations are executed.
Finally, the SSDCNN-HPE methodology has accomplished maximum performance with
higher accuracy of 0.993 on Penn action dataset, where the existing models achieved nearly
0.98 and 0.99 of accuracy on the same dataset.
From the literature it is observed that HAR models developed in an unconstrained
environment have several limitations like Personal Interference (PI), Electromagnetic (EM)
3. noise, in-band noise or human movements and outliers involved while capturing the input
data. These noises are affecting the overall performance and robustness of the model. In order
to improve the model performance, noise removal techniques are introduced in this work.
Noises like salt and pepper noise, Gaussian noise, and blurring of the boundaries and outlier
treatment are processed for the hybrid data acquired using video and sensors in line-of-sight
mode in this paper. For removing these noises, a combination of filters like Kalman filter,
MOSSE filter, Butter-worth and J filter, are applied to the input data. By doing this noise
removal technique, it is observed that Peak to Sidelobe Ratio (PSR) is reduced from the raw
data. After removing these noises, features are extracted using the top layers of CNN
Inceptionv3 model for video data. Similarly, by using inertial sensor, features like tri-axial
accelerometer, gyroscopes and magnetometers are collected and a feature vector is created.
Pyramidal flow feature fusion (PFFF) technique is used to fuse the extracted features from
video and inertial sensor data. Finally, the fused features are given to the SVM classifier to
perform activity recognition. The proposed experimental methodology has been tested on
UCF sports data-set and the results have been obtained. From the results obtained, it is
observed that by performing a noise removal process and introducing hybrid features, the
robustness of HAR model have been improved and able to achieve an accuracy of 97.4% in
noisy data and 98.7% in noiseless data.