Chapter 1_Introduction.docx

Chapter 1: Introduction
This chapter presents the background of the research area, presenting an overview of
Human Action Recognition (HAR) and the challenges and applications of HAR. Moreover,
the need for HAR system, motivation behind this research, the research objectives, social
contributions and the organization of the thesis are also discussed.
1.1. Introduction about Computer Vision
The most superior sense of the human-being is vision. It canautomatically gain high
level understanding of images and videos. In the sameway, computer vision research area
focuses on understanding the content of imagesand videos by computers i.e, enabling
computers to see. Computer vision involvesautomatically extracting the information from
images or videos. This informationcan be collected from 2D or 3D models and detected
objects. Finally, recognitionis performed by grouping or tracking of collected information.
Computer vision isused in wide range of application fields due to their importance in real-
world. The applications include Optical Character Recognition (OCR), machine inspection,
3D model building, medical imaging, automatic safety, sports pose estimation, motion
capture, surveillance, finger print recognition, biometrics, etc. The typical security
surveillance uses multiple cameras and requireintense man-power for monitoring the content
of the video. The continuousmonitoring of multiple cameras by a human may cause error. On
the other hand,smart surveillance systems aim at automatically detecting moving objects in
videosand understanding the video content.
This thesis focuses on actions performed by a single person. An action isa
combination of multiple atomic actions. For instance, boxing, running, joggingand jumping
are actions; the atomic actions involved in boxing are boxing position,left hand forward, right
hand forward, right leg forward and left leg forward.Recognition of human actions find
importance in various application areas suchas video surveillance, sign language processing,
video indexing and sports videoanalysis. Automatic human activity Recognition is a
challenging problemin computer vision. One of its main goals is the understanding of the
complexhuman visual system and the knowledge of how humans represent the humanactivity
to discriminate different identities with high accuracy.This kind of systems, namely human
activity detection andrecognition of the detected human activity, must address two
fundamental andconceptually independent problems. Work on the recognition stage takes
thedetected human activity values as input to the algorithm. This stage can be separated in

two steps: feature extraction,where vital information for discrimination is saved, and the
matching step,where the recognition result is given with the aid of a human activity database.
1.2. Human Activity Recognition
The research in the fields of computer vision has dealt with the analysis of the
temporal sequence of image data. The vast availability of low cost and high-quality digital
cameras with massive storage capabilities, and high-speed bandwidths has all contributed to
realize the role of video processing in various application domains. Also, human biological
visual systems are well managed to handle spatiotemporal information, which is yet another
reason for the popularity of video data analysis. Video-based tracking and analysis help to
observe the motion patterns in a non-intrusive way. This task must be well modeled when
dealt with human subjects. Video surveillance systems play a significant role in the
circumstances where continuous observation by visual analysts is not possible. Typically,
visual analysts continuously monitor and collect data from various cameras and report to the
authorities when the necessity arises. However, it is manpower intensive. Hence, it is
necessary to build an automatic human action recognition system and build a higher-level
behavior modeling for the events occurring in the scene.
In the computer vision community, ‘Action’ and ‘Activity’ are frequently
interchangeable terms. In this thesis, ‘Action’ is defined as simple motion patterns which are
usually exhibited by a single person for a concise duration. Examples of activities include
running, walking, and waving. On the other hand, ‘Activity’ is defined as a complex sequence
of actions where several humans are involved, and they interact with each other in a
constrained manner. Activities are typically characterized by much longer temporal durations.
Examples of activities are two persons shaking hands, a football team scoring a goal and a
coordinated bank robbery multiple people. Recognition of human actions from multiple
views needs high-level interactions with broad applications such as video surveillance,
human-computer interaction, motion analysis, video indexing, and sports video analysis. The
challenges of human action recognition come fromdifficulties such as significant intra-class
variance, scaling, occlusion, and clutter. Human action recognition has been extensively
researched through methods based on local and holistic representations.
HAR is a highly dynamic system that finds the actions of persons based on videos
acquired from observation and information about the context of the monitored activities. An
activity is identified irrespective of its environment and the person who performs the action.

In HAR process, different steps are involved in gathering data from raw data to identification
of the performed activity. The essential objective of HAR systems is to monitor and examine
human activities for recognizing the ongoing events. HAR systems recover and develop
contextual data with the aid of visual and non-visual sensory data to recognize the human
behavior. A dictionary-based approach was implemented for sparse characterization of noisy
and radio frequency identification (RFID) streaming signals. The human actions are detected
with better representation of activities using optimal data. This is followed by orthogonal
matching pursuit to address sparse optimization problems. In addition, features are mined
from unrefined signal strength stream with the aid of rank-based feature selection method.
However, robust features are not extracted for achieving effective activity recognition.
Generally, HAR is classified into the following three stages of representations:
a) Low-level core technology
b) Mid-level human activity recognition systems
c) High-level applications
Low-level Core Technology
The main processing stages involved in low level core technology are
 Object segmentation
 Feature extraction and representation
 Activity detection and classification algorithms
Initially, human object is segmented and their features are extracted from the video
sequence as a set of characteristics. Then, categorization approach is implemented on mined
features for recognizing the different human actions.
Mid-level human activity recognition systems
The second level of human activity recognition systems includes the following three
significant tasks.
 Single person activity recognition
 Multiple people interaction and crowd behavior
 Abnormal activity recognition
High-level applications

The application in which results acquired through processing of raw data are
explained in high-level applications of HAR systems.
1.2.1 Local Representations
Methods based on local representation, also known as local methods, encode a video
sequence as a collection of local spatio-temporal features (local descriptors). These local
descriptors are extracted from spatiotemporal interest points (STIPs) which can be sparsely
detected from video sequences by detectors. In contrast to holistic representations of human
actions, local methods enjoy many advantages.
 Avoidance of some preliminary steps, e.g., background subtraction and target tracking
required in holistic methods.
 Resistance to background variation and occlusions.
However, local representations also have disadvantages, of which a fundamental
limitation is that it can be too local, as it is not possible to capture adequate spatial and
temporal information. Local descriptors can also be obtained from trajectories. It indicates
that it is enough to distinguish human actions by the tracking of joint positions. One of the
advantages of using trajectories is being discriminative. Nevertheless, the performance of
trajectory-based methods depends on the quality of these trajectories, and in practice
extracting trajectories from video sequences would be computationally expensive. To obtain
the final representation of an action, the Bag-of-Words (BoW) model has been widely used
and has achieved excellent results in human action recognition tasks. The BoW model is
based on mapping local features of each video sequence onto a pre-learned dictionary, which
unavoidably introduces quantization errors during its creation. The errors would be
propagated to the final representation and harm the recognition performance. Additionally,
the size of this dictionary needs to be empirically determined, and code words, i.e., the cluster
centers, obtained by k-means, gather around dense regions of local feature space, resulting in
less effective code words of action primitives. Sparse representation has recently been
introduced for action representation based on local features.
1.2.2 Holistic Representations
Methods based on holistic representation also called as global methods, treat a video
sequence rather than applying sparse sampling using STIP detectors or extracting trajectories.
In holistic representations, spatio-temporal features are directly learned from the raw frames

in video sequences. Holistic representations have recently drawn increasing attention because
they can encode more visual information by preserving spatial and temporal structures of
actions occurring in a video sequence. However, holistic representations are highly sensitive
to partial occlusions and background variations. Additionally, they often require
preprocessing steps, such as background subtraction, segmentation, and tracking, which
makes it computationally expensive and even intractable in some realistic scenarios.
1.3. Challenges of Automatic Human Activity Recognition
Some of the major Challenges related to HAR in videos are summarized as follows:
• Spatial variations
Recognition of human actions in the presence of spatial variations such as translation,
scaling (Variation in sizes) and rotation is challenging.
• Lighting variations
Rapid changes in lighting cause erroneous human recognition both in indoor and
outdoor videos. In indoor scenario, abrupt switch on or off of light may lead to difficulty in
recognition. In outdoor scenario, drastic change of weather, which may include changes in
lighting, sudden rain, cloudy weather, etc. may source false positive object recognition.
• Occlusions
Occlusions created when two or more objects appear in the same frame with an object
hiding the other introduce difficulty in recognition. The object of interest can be either
partially occluded or fully occluded. It affects the visibility of all the body parts of humans in
the video.
• Background clutter
Recognition of moving object in outdoor environment is hard. For instance, swinging
tree branches or leaves, birds in sky, clouds, waves, smog establish irregular movements or
periodic changes in the background.
• Moving shadow
Reflection of the object of interest in moving environment causes problem in HAR,
because shadow moves along with objects. Separating the shadow region from the object is
hard.

• Appearance of moving object
The appearance of the human may be transformed when the person is in motion. For
instance, the front view of the human is different from the top view or side view. Appearance
of moving human may be different.
Automatic human activity Recognition is a challenging problem in computer vision.
One of its main goals is the understanding of the complex human visual system and the
knowledge of how humans represent the human activity to discriminate different identities
with high accuracy. This kind of systems must address two fundamental and conceptually
independentproblems such as Human activity detection and Recognition of the detected
human activity. The focus of this work is on the recognition stage, i.e. taking the detected
human activity as the input to the algorithm. This stage can be separated into two steps:
1. Feature extraction, where vital information for discrimination is saved, and
2. Matching step, where the recognition result is given with the aid of a human
activity database.
Vision is the most superior of human senses, and it is no surprise that images and
videos convey more critical information during human perception. Image processing paved a
broad spectrum of application fields due to their variety of light energy, namely visible,
Ultraviolet, X-rays, Gamma rays, Infrared, Microwaves, and radio waves. In general, image
processing is distinguished as low level, mid-level, and high-level processing. The lower
level processing, also termed as image preprocessing, involves primitive operations on
images such as noise removal, contrast enhancement, and image sharpening. The mid-level or
intermediate processing on images involves tasks such as segmentation, object representation,
description, and classification. The higher-level processing involves image recognition,
image understanding, or computer vision.
In this context, HAR has gained a lot of attention in recent times and has potential
applications in various areas.
1.4. Application of HAR
The remarkable perspective in research on HAR becomes evident when looking at the
applications that may be benefited from the research works in this domain. Some application

areas that highlight the potential impact of vision-based action recognition system are listed
in Table 1.1 and discussed in this section.
Table 1.1 Applications of Human Motion Analysis.
S.No. Domain Filed of application
1. Kinesiology
Clinical orthopedic studies Biomechanics Person-
centric learning
2. Computer Graphics and Animation
Virtual reality Games Animated movies
Teleconferencing Games
3. Behavioral Biometrics
Gesture driven control Gait analysis Human
behavior learning
4. Content-based Video Analysis
Video indexing, Video retrieval Video
summarization, Storyboarding Sports video
analysis
5. Surveillance and Security
Parking lots, Anomaly detection Vehicle tracking
Human action recognition
6. Human Computer Interaction
Human Computer Interface(HCI) Ambient
intelligence Expression recognition
i. Kinesiology: Kinesiology, the study of biomechanics, involves the development
of human body models aiming to improve the efficiency of human movement.
This is widely used in medical studies of orthopedic patients, and such studies
need detailed information about the movement of body parts and joints. This
information is gathered in an intrusive way by placing retro-reflective markers or
light emitting diodes (LED) on the human body.
ii. Computer Graphics and Animation: HAR can be used to develop a high-level
description of movements of dance, ballet, and sports field. This has been used to
study and synthesize realistic motion patterns of virtual world humans.
iii. Behavioral Biometrics: Biometrics involves the study of approaches and
algorithms for uniquely recognizing humans, based on human physical and
behavioral cues. The appearance-based tracking approaches solve the problems of
gait recognition and gesture recognition. The physical body parts are marked, and
their motion trajectories are mapped to identify the human. This domain has a

broad spectrum of potential applications which have been discussed by
researchers in the fields of hand gesture-driven control of robots, human gait,
abandoned luggage detection, emotion recognition, and behavior learning.
iv. Content-based Video Analysis: Since internet service providers (ISP) face
persistent growth, it has become compulsory to develop efficient indexing and
storage schemes to improve the user feedbacks for content-based video searching.
This involves the learning of flow patterns from raw video and summarizing the
content of the video. Discussed the enhanced content-based video summarization
with corresponding advances in content-based image retrieval. The standard
approach for indexing and retrieval is to query the video database using semantic
action descriptors like ‘videos where the person kicks somebody.’ So, the retrieval
is the kind of process to deliver the analogous videos according to action-based
features in the query video.
v. Surveillance and Security: Video surveillance and related security systems
mostly rely on multiple video cameras, monitored by a human operator who
should know the activity of interest in the field of view. The monitoring efficiency
and accuracy of the human operators can be stretched for increasing the number of
cameras and their deployments. Hence, security agencies are seeking vision-based
solutions which can replace or assist a human operator.
Many researchers concentrated on anomalies in automatic recognition problem.
Generally, the existing surveillance techniques performed a post operation task
and detected the abnormal events based on the actions that have been committed.
These techniques, many times, need manual intervention for real-time detection.
However, activity detection systems incorporated with ‘Smart Surveillance
Systems’ (SSS) in real time detection are also possible without manual
intervention and are used in banks, ATM centers, parking lots, supermarkets and
department stores. So, the conflicts in such surveillance applications are the main
factors of SSS which affect the privacy of human beings.
vi. Human-Computer Interface: Understanding the interaction between a computer
and a human remains one of the enduring challenges in designing human-
computer interfaces.
1.5. Recognition of Actions in Human

Activity recognition is a new ground for the development of robust machine learning
techniques, as applications in this field typically require to deal with high-dimensional,
multimodal streams of data that are characterized by a large variability (e.g., due to changes
in the user's behaviour or as a result of noise). However, unlike other applications, there is a
lack of established benchmarking problems for activity recognition. Typically, each research
group tests and reports the performance of their algorithms on their datasets using
experimental setups specially conceived for that specific purpose.For this reason, it is
difficult to compare the performance of different methods or to assess how a technique will
perform if the experimental conditions change (e.g., in case of sensor failure or changes in
sensor location). We intend to address this issue by setting up a challenge on activity
recognition aimed to provide a common platform that allows the comparison of different
machine learning algorithms on the very same conditions.
This is called for methods for tackling critical questions in activity recognition such as
classification based on multimodal recordings, activity spots, and robustness to noise —
methods in activity recognition using semantic features. Unlike low-level features, semantic
features describe inherent characteristics of activities. Therefore, semantics make the
recognition task more reliable, especially when the same actions look visually different due to
the variety of action executions.This defines a semantic space, including the most popular
semantic features of action, namely the human body (pose and pose let), attributes, related
objects, and scene context. This work presents methods exploiting these semantic features to
recognize activities from still images and video data as well as four groups of activities:
atomic actions, people interactions, human–object interactions, and group activities.
1.6. Motion Analysis
The conventional Human Motion Analysis (HMA) system can detect, track, and identify the
moving humans in a video sequence and may include recognizing their action. The types of
interaction of the HMA system with the environment are as follows:
 Passive - It captures merely and stores the visual information in an organized fashion
without performing any analysis.
 Active - It controls and adjusts the acquisition device parameters, namely pan, zoom,
and tilt effects, depending on the external environment conditions.
The HMA typically has the following three necessary steps:

I. Detection: It involves finding the answer, ‘Is there motion (corresponding to a
human) present in the scene?’ and essentially requires low-level processing of images.
II. Tracking: It answers the question: ‘Where is the human moving?’. The tracking is
of significant importance to HMA. It needs history to be maintained for action
recognition, and it involves the mid-level processing on the history of images.
However, sometimes there may be considerable overlap between detection and
tracking algorithms.
III. Action Recognition / Behavior Understanding: It is a high-level vision step, which
involves interpreting the information derived in the steps as mentioned above in order
to answer the question ‘What is the human doing?’
1.6.1 Properties of Shape Features
The machine vision has super excellent performance towards complicated problem solving
with small complexity. Also, choosing appropriate features for a shape recognition system
must consider the kind of features suitable for the task.
The techniques to describe the shape of a deformable object have been extensively studied by
researchers for the past few decades compared many shape representation techniques to
characterize a deformable object. These techniques belong to the following three main
categories:
(i) Boundary-based and region-based methods
(ii) Spatial domain and frequency domain methods
(iii) Information preserving and non-information preserving methods
In a boundary-based shape description, the shape boundary points are only used to represent
the shape. However, in the region-based methods, the combination of boundary and interior
points are used to represent the shape.
In the spatial domain methods, the point feature basis helps to compare two shapes, and the
vector basis is used in frequency domain methods. The information preserving method
provides an accurate shape reconstruction from the shape descriptors, and the non-
information preserving methods offer only partial reconstruction with a compromise on the
identifiable property of shape features.
The identified shape features should have the following essential properties irrespective of
the description strategy:

 Identifiable: The shapes which are similar to human perception should have
the closest feature correspondence.
 Spatial invariance: The spatial transformations such as translation, scaling, and
rotation should not affect the extracted features.
 Noise resistance: Features must be as robust as possible against noise, i.e., the
noise strength of specific range should not affect the pattern.
 Occultation invariance: The original shape features should not be affected
during the occultation.
 Statistically independent: Two distinct features must be statistically
independent and must ensure the compactness of the representation.
 Reliable: To ensure minimum intra class variability.
The widely used methods for human shape representation in HMA are as follows:
 Bounding Box - This is the smallest rectangle that contains every point in the
object shape. The shape can either have a global bounding box or a collection
of bounding boxes which describe the human body parts, namely head, torso,
legs, hands, and feet. However, this representation conveys the coarse features
only.
 Moments - The magnitude of a set of orthogonal complicated moments of the
image known as Zernike moments is most suitable for shape similarity-based
image retrieval in terms of compact representation, robustness, and retrieval
performance But, the presence of many factorial terms of Zernike moments
leads to complex Computation.
 Chain code - introduced the chain code approach, which represented the
complete boundary description of the shape that ensures retaining of the shape.
The chain code cannot tolerate scale and rotation variations, which are
common problems that occur in object tracking.
 Fourier descriptors – The high-frequency components of Fourier descriptor
are associated with corners, and low frequencies are associated with shape’s
border. The Fourier descriptor is used to transform a set of boundary pixels
from the spatial domain representation into the frequency domain. The Fourier
descriptors are also invariant to scale rotation and starting point of the shapes.

So, the accuracy of any pattern recognition system is based on the selection of appropriate
shape feature.
1.7. Video Sequence analysis
A sequence of images displayed at a specific rate of frequency (frame per second) is termed
as a video sequence. It consists of information in the form of spatial changes concerning time
and if someone wants to extract the information from a video sequence, then the spatial
change concerning time must be perceived.
For the video sequence analysis (VSA), the knowledge of core technologies of digital image
processing is the fundamental requirement, which includes image enhancement, image
segmentation, morphological operations, feature extraction and representation, image
classification.
The main aim of this analysis is to automatically detect and determine the spatiotemporal
event in the video signal and further to interpret the nature of the event. In this analysis, the
identification of appropriate objects/regions in the scene (segmentation), and most detailed
characteristics of each object/region in the entire scene (feature extraction) is extracted. In
both the cases, the segmentation, and feature extraction the spatial and temporal dimensions
must be taken into consideration for the effective representation of object/region.
Feature extraction is useful for both coding and indexing, and the adequate coding parameters
for each of the object/region are set. A similar process is used for segmentation, to identify
the objects/regions, which permits and distinct coding and unwraps enhanced interaction
possibilities. Similarly, the capability to define content in an object/region-based technique
increases the plenty of the description.
Since last few decades, the VSA using intelligent techniques has emerged as a promising and
exciting area of research in the field of computer vision and image processing due to the
critical issues and numerous applications of this analysis.
1.7.1. Applications of VSA
There are numerous applications of VSA, and these are broadly categorized as:
 Entertainment: One of important application which is directly related to the daily
life of human beings like television (TV), movies, high definition television
(HDTV) transmission, video games, live streaming of sports analysis.

 Commercial: VSA can be used for commercial purposes such as smart CCTV,
advertisement of the product, in the retail industry for tracking of shoppers inside
the store.
 Security and Surveillance: Most widely used application for the security and
safety purpose by military and police. It can be monitoring of crowd behavior at
public functions, terrorist activities at public places like airports, railways stations,
bus stands, robbery detection, home intrusion system.
 Human-Computer Interaction: Nowadays, the interaction of human with the
machine is increasing rapidly due to the advancement of VSA technologies. The
traditional way of interaction of human with the machine is through remote
control, keyboard, mouse, joystick. But in the coming years, these modes of
interactions may become obsolete due to the invention of various recognition
systems based on body pose, hand gesture, facial expression.
 Motion Analysis: There are a variety of systems, where VSA is used to detect and
determine the motion of the object. The effective detection, tracking, and
recognition of an object leads to several critical applications like human activity
recognition system for detecting various kind of human abnormal and regular
activities, object tracking, intruder detection, and industrial monitoring.
As it is highlighted that VSA has several applications in the various fields of science and
technology, but the focus of this research is to design and develop a novel VSA algorithm for
human activity recognition (HAR) system.
1.7.2. Challenges in VSA
In the VSA, there are numerous factors which limit the performance of VSA system and these
are the recording settings, illumination variations, camera motion, viewpoint variations, the
complexity of the background, the similarity between foreground and background object,
high dimensionality, and redundancy of the data. All these factors provide an open challenge
to the researchers/technocrats, to design and develop such an algorithm which can deal with
these issues.
The environmental conditions play a significant role when the recording/acquisition of the
video signal is done because the performance of vision based system is highly dependent on
the weather conditions. The captured video signal in adverse environmental conditions leads
to poor quality of video signal. In the poor quality of video signal, the object and background

of the scene may not be discernible, and due to this, the following task (segmentation, feature
extraction) associated in video sequence analysis system leads to worst performance.
Hence, proper illumination is needed to acquire a good quality video signal, where object and
background must be discernible. The motion of the camera creates a blurred object in the
scene, and due to this, the new de-blurring algorithm is needed to de-blur the object. Hence,
to avoid this, there must be a proper installation of the camera.
The complexity of the background introduces the problem of extracting an object from the
scene. Due to the cluttered background, the object (foreground) and background may have
more similarity, and due to this, the accurate segmentation of the object may not be possible.
Hence, it is worthwhile that the recording of the scene must be performed based on the
application to avoid these issues.
1.8. HAR Identification Using Artificial Intelligence
Human action recognition and analysis one of the most active topics in computer vision, has
drawn increasing attention and its applications can be found in video surveillance and
security, video annotation and retrieval, behavioral biometrics and human-computer
interaction.
The term ‘Action’ refers to simple motion patterns usually executed by a single person and
typically lasting for short durations of time, on the order of tens of seconds. Examples of
actions include bending, walking, swimming, and so forth. The goal of action recognition is
to analyze ongoing actions from an unknown video automatically. In the last few decades,
action recognition has been extensively researched while there is still a long way to go for
real applications.
Human action recognition is a critical component of visual surveillance systems for event-
based analysis, and modeling human action is a challenging task for recognizing the
activities. The conventional human action recognition system for video surveillance has the
following major subsystems:
(i) Background subtraction
(ii) Feature extraction
(iii)Posture estimation
(iv)Action recognition

Background subtraction is a simple solution in motion segmentation. A static image without
any object of interest (OOI) would be considered as the background image. The difference in
pixel level between the successive frames and the background image provides motion
information.
Feature extraction, by human perception, is a very complex task in shape recognition. The
selection of appropriate shape feature would be considered for improving the accuracy of any
pattern recognition system, and detailed experimental analysis could be used to measure the
suitability of the feature to obtain the proper outcome.
Human posture refers to the arrangement of the body and its limbs. This is one of the critical
aspects of analyzing human behavior. Likewise, implementation of action recognition system
meets many challenges at each subsystem. The background subtraction suffers from clutter,
illumination changes, and camera movements, and the existence of either noise or partial
occlusions during tracking affects the feature extraction stage.
The typical task of a HAR system is to detect and analyze human activity in a video
sequence. The challenges in vision-based HAR systems. Several factors that make the task
challenging are the variations in body postures, the rate of performance, lighting conditions,
occlusion, viewpoint, and cluttered background. A sound HAR system can adapt to these
variations and efficiently recognizes the human activity class.
The essential steps are involved in HAR systems are usually: a) Segmentation of foreground
b) Efficient extraction and representation of feature vectors, and c) Classification or
recognition. A practical and novel solution can be proposed at any step of the work
individually or collectively for all the steps. Due to the variation in human body taxonomy
and environmental conditions, every step is full of challenges, and therefore, one can only
provide the best solution in terms of recognition accuracy and processing speed. The shape
and motion feature-based descriptors are two widely used methods in HAR systems.
The silhouette of the human body generally represents the shapebased descriptor, and
silhouettes are the heart of the activity. Motion-based descriptors are based on the motion of
the body, and the region of interest can be extracted using optical flow and the pixel-wise
oriented difference between the subsequent frames. The motion-based descriptors are not
efficient, especially when the object in the scene is moving with variable speed. The
subsequent section highlights the details of issues related to HAR system and its
functioning.The changes in camera viewpoint, anthropometry (body shapes and sizes of

different actors), different dressing styles, actions’ execution rate and styles of actor’s and so
forth exhibit large intraclass variability while recognizing the actions.
Human Behavior Understanding
Automatic analysis in surveillance scenarios failed to fulfill the real-world need at the action
recognition stage alone, which requires the detection of abnormal human actions based on the
periodicity in their behaviour patterns. Abnormal actions are context dependent, that includes
infrequent posture patterns.
For example, in a supermarket, a person walking from one point to another is ‘normal’
whereas running is ‘abnormal’ behavior. In a visual surveillance scenario, identifying the
occurrence of possibly dangerous actions helps the visual analyst to minimize the manual
effort and mistakes.Hence, the behavior understanding, along with action recognition, could
be used for surveillance purpose. Thus, combining human action recognition and behavior
understanding model is necessary for obtaining practical video data analysis.
Human activities are present during comparatively longer period of time than sensors’
sampling rate. A specific sample on a time instant does not offer adequate data for explaining
the performed activity. The activities are required to be identified in a time window basis.
Therefore, feature extraction approaches are utilized for acquiring significant information and
quantitative measures. The integration of original features is a substitute method for choosing
a division of significant features. Feature extraction is the method of integrating the original
feature set for describing a new significant feature set.
In order to perform effective human action recognition, PTD was designed from 3D joint
locations sequence. A hierarchical temporal dividing algorithm is developed to subdivide a
sequence into compact sub-sequences for encoding the temporal information. Through joint
positions, posture feature and dynamical tendency discriminative features of human posture
are obtained. Natural and intuitive dividing algorithms assist in encoding temporal dynamics.
Further geometric structures of action snippets are also maintained. But, the minimization of
computational complexity during the human activity recognition is not performed.
In other words, feature extraction is the conversion of high-dimensional data into a significant
illustration of minimum dimensionality data. The key benefit of feature extraction is that it
assists in categorization and visualization of high-dimensional data.
Feature extraction from time series data is executed in the following two ways:

 Statistical Techniques
 Structural Techniques
Statistical Techniques employ quantitative features of data to mine features while structural
techniques consider the interrelationship among data. Depending on the nature of specified
signals, these approaches are selected. A Hierarchical Spatio-Temporal Model (HSTM) was
developed by for addressing implementation issues of spatial and temporal restraints. HSTM
is a two-layer Hidden Conditional Random Field (HCRF) which assists in identifying spatial
associations in each frame. Then, bottom-layer HCRF identifies discriminative
representations and top-layer HCRF.
1.8.1. Machine Learning based Model
Machine learning performance was examined for recognizing hand gestures. Gesture
recognition accuracy is increased by using normalization approach. A non-parametric
Wilcoxon signed-rank test is applied to investigate the gesture recognition accuracies. Then,
accuracies achieved with the aid of normalization to AUC-RMS value method are also
evaluated. Classification accuracy is increased and applications of bio-medical are also
supported. However, complex hand gestures are not recognized efficiently using machine
learning techniques. The learning approaches employed in machine learning are classified
into the following three categories:
 Supervised
 Unsupervised
 Semi supervised
Supervised Learning
In supervised learning, a set of models of normal or abnormal behavior are constructed
depending on labelled training samples. The video samples which do not suit for any model
are categorized as abnormal. This type of learning approach is restricted to only known
events which necessitate adequate training data. But, real world video samples comprise
actions that are not infrequent which leads to the lack of adequate training samples.
k-Nearest Neighbors (k-NN) is one of the supervised classification approach employed for
direct classification of human activities without a learning process. K-NN necessitates only
the data storage space. K-NN algorithm employs the principle of similarity between the
training set and new data for categorization. The new data is allocated to the common class

by means of majority vote of k nearest neighbors. Then, the distance of neighbors is
evaluated by using a distance measurement termed as similarity function or Euclidean
distance.
Unsupervised Learning
Instead of depending on labelled training information for identifying the decision boundaries
among classes, unsupervised learning designs employs the underlying organization of
unlabelled data to carry out processes like clustering or dimensionality reduction.
Markov chain characterizes a discrete time stochastic process which influences a finite
number of states so that present state depends on previous state. Hidden Markov Model
(HMM) is employed in two-level classification methods for differentiating various daily
living actions. During human activity recognition, each activity is characterized with a state.
A Markov chain is designed to construct sequential data and employed in general model
called HMM. HMM presumes that the observed sequence is managed by a hidden state
sequence. When HMM trained, frequent sequences of activities are estimated with the aid of
Viterbi algorithm. However, HMMs fails in ensuring the convergence to global minimum and
initialization of EM algorithm is to be considered. HMM is also trained with the aid of the
posterior probabilities for restricting the complexity issues.
Semi-supervised Learning
Semi-supervised learning integrates a small amount of labelled training data with huge
volumes of unlabelled data for executing tasks such as categorization or ranking to offer
information about a sector’s structure. Then, labelled data is employed for achieving exact
classification/regression outcomes.
An activity recognition approach was developed by EneaCippitelli et al. (2016) for extracting
skeleton data with the help of RGBD sensors. On the basis of key pose extraction, a feature
vector is formed. Then, a multiclass Support Vector Machine is applied to perform
classification and a clustering approach established to carry out evaluation and involvement
of key poses. Activity recognition is improved but action segmentation and discovery of
unidentified activities are not efficiently performed.

1.8.1.1 Support Vector Machine
Support Vector Machines (SVMs) are one of the most significant machine learning
approaches used for analyzing the data and recognizing patterns. They are commonly
employed for categorization and regression. SVMs assist in estimating the optimal
hyperplane for dividing data into linear and non-linear classes. As SVMs belong to
supervised learning, training samples are employed. Every training sample is a pair of an
input object and a preferred output value. SVMs examine the training data and construct a
conditional function for exact estimation of class label to an unobserved input object. SVMs
is also applied in a number of pattern classification applications such as image detection, text
labelling, face recognition and faulty card discovery.
SVM are widely used in HAR by influencing kernel functions which process all the instances
to a higher dimensional space to identify a linear decision boundary for categorizing the data.
The placement-independent and subject independent trials are improved by avoiding different
placement and subject including orientation differences. Following it, detection performance
on placement-independent and subject-independent experiments is improved by using an
online independent support vector machine (OISVM) algorithm. CT-PCA scheme fails in
performing feature selection leading to decreased effectiveness of human activity recognition.
1.8.2 Deep Learning Methods
Deep learning methods are potentially more suited for human activity identification.
However, they fail in extracting robust features. A Long Short Term Memory (LSTM) and
Recurrent Neural Network increases accuracy and performance. In addition, LSTM Recurrent
Neural Network does not assume computational complexity.
Convolutional Neural Network Model
Convolutional Neural Network Model (CNN) is a machine learning technique of feed-
forward artificial neural network, employed in a number of applications in robotics, computer
vision and video surveillance. CNN approach assists in executing automatic feature
extraction with more exact outcomes. CNNs help in minimizing the connections and
parameters employed in artificial neural network model for flexible training. A CNN is a
collection of two layers such as,
 Convolutional Layers
 Pooling Layers

Convolutional Layers
Convolutional layers are employed as feature extractor for obtaining feature maps as input. A
shifting window is employed for convoluting each feature map to generate one pixel in an
output feature map. In addition, 3D convolutional layers are also employed for extracting the
motion data from various stacked frames.
Pooling Layers
The main objective of pooling layers is to minimize the spatial size of characterization that is
more effective to limited alterations in the position of features in the prior layer.
CNNs establish a degree of locality to the patterns matched in input data and allow a
translational invariance based on exact location of each pattern. The temporal convolution
layer matches to the convolution of input sequence with various kernels. Max-pooling finds
the maximum sequence and performs subsampling to implement translational invariance.
The, output of each max-pooling layer is converted by employing a ReLU activation
function. The CNN classifier includes a number of convolutional neural networks for mining
the robust features and a classifier for classification. Figure 1.4 shows a CNN Classifier
Framework for Human Action Identification.
As features are taken from the input and processed for performing feature extraction. Then,
convolution masks are executed to identify relevant discriminative local features through
subsampling and feature pooling for classification of activities. After the local features are
extracted, feature classification is performed through CNN classifier.
Convolutional Neural Networks (CNNs) to resolve the issues of user-independent human
activity recognition. An improved statistical feature is utilized by maintaining global features
of accelerometer time series. Feature engineering and data preprocessing are not necessitated
and also recognition interval was reduced. Mobile applications are also initiated in real time.
Due to the inefficiency in extracting discriminative feature, human activity recognition
accuracy is minimized.
CNN in Data Processing
The raw data is preprocessed and CNN is applied on the preprocessed data. Then, 2D
convolution is performed and activation function is employed for nonlinear processing. It is
followed by max pooling layer that is applied after every convolution layer for extracting

considerable features from sensor data. The output of convolution layer and pooling layer
characterize the features of input sensor data.
1.9. Visual surveillance System
Visual surveillance in machine understanding has been investigated worldwide during
the last few decades. Human motion detection & tracking from video imagery is one of the
most active research fields. The social interests in movement detection and tracking of people
have enormously increased in recent years with numerous applications like human computer
interface, surveillance, security, and many others. Moving object detection and tracking is a
challenging computer vision task consisting of two closely associated video analysis
processes. The first one object detection involves locating an image object in the frames of a
video sequence, while object tracking performs the monitoring of the video object temporal
and spatial changes during the sequence, including its presence, shape, size, position.
1.9.1 Need for Surveillance System
Computers have become essential in daily lives. They perform repetitive, data-intensive and
computational tasks, more accurately and efficiently than humans. It is very natural to try to
extend their capabilities to carry out more intelligent tasks for example analysis of visual
scenes or speech, logical inference and reasoning – in brief the high-level tasks that we,
humans, perform subconsciously hundreds of times every day with so much ease that we do
not usually even realize that we are performing them.
Privacy remains one of the ethically crucial issues of the information age, and it will remain
like this forever because of human nature. People desire the safety and the privacy, but our
actions and behaviors are defined by interests, which sometimes can be achieved only by
violating other humans safety or/and privacy.
To guard one's safety, thousands of surveillance cameras were installed all over the world.
Human operators are processing the information collected by the cameras inefficiently and
slowly. One possible solution for this issue is to build an automatic system, which would
fulfill the human operator’s job better and faster. This problem is difficult. First, the number
of people must be determined, and then their location needs to be estimated. Detecting people
and estimating their location is a challenging task as peoples move randomly, they wear
different clothes and appear in different gestures. This work is research on object tracking and

how it can be applied to surveillance footage. Surveillance networks are typically monitored
by one or more people, looking at several monitors displaying the camera feeds.
Each person may potentially be responsible for monitoring hundreds of cameras, making it
very difficult for the human operator to detect events as they happen effectively. During the
last few decades, there are attempts by various researches to design a better surveillance
system. However, the problems such as illumination change, motion blur complexity. Persist;
hence, there is vast scope to implement the optimal system. The evaluation of human motion
in image sequences involves different tasks, such as acquisition, detection motion
segmentation, and target classification.
1.9.2 Challenges in Surveillance System
Detecting and tracking people in scenes supervised by cameras is an essential step in many
application scenarios such as surveillance, urban planning or behavioural. The amount of data
produced by camera feeds is so large that it is also vital to be performed with the most
exceptional computational efficiency and often even real-time. Human motion segmentation
and tracking can be completed in two or three dimensions depending on the difficulty of
analysis, representations of the human body shape range from volumetric models to basic
stick figures.
Tracking depends on the correspondence of image characteristics between consecutive
frames of video, taking into concern information such as color, shape, position, and
consistency. Boundary segmentation can be performed by comparing the contrast or/and
color of neighbouring pixels, looking particularly for rapid changes or discontinuities.
However, motion segmentation & tracking is still an open and significant problem due to
dynamic environmental conditions such as illumination changes, shadows, waving tree
branches in the wind. and difficulties with physical changes in the scene. However, several
essential challenges still exist.
1.10 Internet of Things (IOT)
The internet of things, or IOT, is a system of interrelated computing devices, mechanical and
digital machines, objects, animals, or people that are provided with unique identifiers and the
ability to transfer data over a network without requiring human-to-human or human-to-
computer interaction.

Object detection is the main capability expected from a robot. After an image is acquired, it is
processed. Machine vision image processing methods include Thresholding, Pixel counting,
Segmentation, Edge detection, Neural network processing and Pattern recognition.
1.10.1. Benefits of IOT
The internet of things offers several benefits to organizations enabling them to:
 Monitor their overall business processes;
 Improve the customer experience;
 Save time and money;
 Enhance employee productivity;
 Integrate and adapt business models;
 Make better business decisions; and
 Generate more revenue.
1.10.2. Implementation of IOT-Based Smart video Surveillance System
Smart video surveillance is an IoT-based application as it uses the Internet for various
purposes, also providing more security by recording the activity of that person. While leaving
the premises, the user activates the system by entering a password.
System working starts with the detection of motion refining to human detection followed by
counting human in the room, and human presence also gets noticed to a neighbor by turning
on alarm. Also, notification about the same is sent to the user through SMS and e-mail.
1.11. Problem statement
Posture Tendency Descriptor (PTD) is implemented to conserve the geometric
organization of relevant action snippets within skeleton. Encoding of temporal data is
performed by using a dividing algorithm. Subsequently, an interpretable and discriminative
descriptor is implemented to represent one action snippet. Complete skeleton sequence is
integrated with different PTDs in a hierarchical and temporal order for distinguishing human
activities. This aids in attaining an efficient human recognition However, it fails in detection
of abnormal activities.
A low-cost, unobtrusive and robust system is established for supporting independent
living. The signal fluctuations are interpreted by radio-frequency identification (RFID)
technology and machine learning approaches. A dictionary-based approach is established to

restrict noisy, streaming and unstable RFID signals. This helps in recognizing informative
dictionaries of activities using unsupervised subspace decomposition. The embodied
discriminative data is employed in utilizing sparse coefficient features of recognized
dictionaries on activity detection task. In addition, activity representation also assisted in
attaining efficient activity recognition. A machine learning approach does not resolve
complex dynamic circumstances.
A deep learning model is designed for differentiating human activities without usage
of previous data. The accuracy of human activity recognition is enhanced by employing Long
Short-Term Memory (LSTM) Recurrent Neural Network. However, variance on
identifications and early stopping criteria is not considered in deep learning model. A user-
independent deep learning-based approach is established for online human activity
categorization. Local feature extraction is performed with uncomplicated statistical features
by using CNN to preserve the global form of time series data. Subsequently, consequences of
time series length on recognition accuracy are examined to perform constant real-time
activity classification. This deep learning-based approachdoes not require any manual feature
engineering. In addition, computational cost is also decreased. However, Scalability is not
significantly improved.
The above discussion reveals that there is a need of providing effective methods for
automated surveillance system in order to develop fast, reliable intelligent vision algorithms
that can track, follow and re-acquire arbitrary moving ground targets (e.g. people)
autonomously as well as satisfy high real-time requirement to generate high value
intelligence information to build a system that will automate the whole video processing. We
concentrate on unusual activity recognition and predicting their behavior (i.e. action) in video
surveillance system for indoor/outdoor public areas.
This objective of research is further divided into following subproblems.
(a) Eliminate moving cast shadow of the foreground object (human).
(b) Recognize human action/behavior analysis or action classification
(c) Detect unusual activity for single/multiple persons
(d) Comparing performance of implemented algorithms with existing solutions in literature
1.12. Motivation of the Research Work

Human Activity Recognition (HAR) is performed for automatic identification of
physical activities for addressing human action identification in mobile and ubiquitous
computing. HAR system implements classification processes for distinguishing various
human daily activities. Human activity identification and classification are essential in
applications such as, observation of elderly people, robotics, surveillance and tracking of
athletic actions. In other words, HAR aims to identify activities from a number of
observations on subject actions and environmental circumstances. Action recognition is one
of the essential factors for visual surveillance video retrieval and human-computer
interaction. Detection of Human activities is denoted as a three-level categorization of
inherent hierarchical structure for representing various levels. The bottom level represents
action primitives comprising highly complex human actions. The second level characterizes
simple actions. The last level indicates complex interactions of more than two persons and
objects.
Human Activity Recognition system is based on a hierarchical structure. The action
recognition module is implemented by reasoning engines for encoding the context of actions.
The increasing advances in hardware, computing and networking, smart phone and sensor
based HAR schemes assist in classifying different types of human activities in real time with
the aid of machine learning approaches. In addition, human activity detection also has
applications in healthcare, personal biometric signature identification and navigation.
1.13. Objective of the Work
The objective of this research work is to enhance the performance of HAR system. It
aims at reducing the computational time of the HAR system by selecting representative
framelets and improving recognition by extracting discriminative features that aid in
effectively recognizing the different actions which involve single person. Hence, this research
work can be applied in improving security in the surveillance environment. The main
objective of the research work is described as follows:
1. A hybrid approach combining the frameworks of both handcrafted features and
deep learning methodology to enhance the performance of HAR.
2. The major intention of the second research technique is to identify the human pose
accurately and efficiently. The technique majorly focused on the detection and classification
of human poses. The technique encompasses a series of processes namely pre-processing,

EfficientNetbased feature extraction, SSA-based hyperparameter tuning, and Modified SVM
(M-SVM) based classification.
3. Finally, noise removal methodology is proposed for both inertial sensor data and
video sensor data to remove different types of noises like salt and pepper noise, Gaussian
noise, outliers and blurring of the boundaries using different types of filters like MOSSE
filter, Kalman filter, Butterworth and J filter. By using this technique Peak of Side-lobe Ratio
(PSR) is reduced from the raw data.
1.14. Methodologies used in the work
In the first objective of the research work, the proposed study attempts to combine
both handcrafted models and deep learning methods since they possess distinct individual
merits and complement the effective and automated identification process. The models are
targeted at enabling lifestyle tracking on a real-time basis through smartphones. This
performance enhanced real-time analysis serves as an efficient and low-cost HAR model,
finding its applications in health care management. The automated learning methodologies in
HAR are either handcrafted or deep learning or a combination of both. Handcrafted models
can be regional or wholesome recognition models such as RGB, 3D mapping and skeleton
data models, and deep learning models are categorized into generative models such as LSTM
(long short-term memory), discriminative models such as convolutional neural networks
(CNNs) or a synthesis of such models. Several datasets are available for undertaking HAR
analysis and representation. The hierarchy of processes in HAR is classified into gathering
information, preliminary processing, property derivation and guiding based on framed
models.
The second methodology is focused on the Human pose estimation (HPE) that is a
procedure for determining the structure of the body pose and it is considered a challenging
issue in the computer vision (CV) communities. Despite the benefits of HPE, it is still a
challenging process due to the variations in visual appearances, lighting, occlusions,
dimensionality, etc. To resolve these issues, this second research work presents a squirrel
search optimization with a deep convolutional neural network for HPE (SSDCNN-HPE)
technique. The major intention of the SSDCNN-HPE technique is to identify the human pose
accurately and efficiently. Primarily, the video frame conversion process is performed and
pre-processing takes place via bilateral filtering-based noise removal process. Then, the
EfficientNet model is applied to identify the body points of a person with no problem

constraints. Besides, the hyperparameter tuning of the EfficientNet model takes place by the
use of the squirrel search algorithm (SSA). In the final stage, the multiclass support vector
machine (M-SVM) technique was utilized for the identification and classification of human
poses. The design of bilateral filtering followed by SSA based EfficientNet model for HPE
depicts the novelty of the work.
Finally, it is noted that HAR models are developed in an unconstrained environment,
which have several limitations like Personal Interference (PI), Electromagnetic (EM) noise,
in-band noise or human movements and outliers involved while capturing the input data.
These noises are affecting the overall performance and robustness of the model. In order to
improve the model performance, noise removal techniques are introduced in this work.
Noises like salt and pepper noise, Gaussian noise, and blurring of the boundaries and outlier
treatment are processed for the hybrid data acquired using video and sensors in line of sight
mode in this paper. For removing these noises, a combination of filters like Kalman filter,
MOSSE filter, Butter-worth and J filter, are applied to the input data. By doing this noise
removal technique, it is observed that Peak to Sidelobe Ratio (PSR) is reduced from the raw
data. After removing these noises, features are extracted using the top layers of CNN
Inceptionv3 model for video data. Similarly, by using inertial sensor, features like tri-axial
accelerometer, gyroscopes and magnetometers are collected and a feature vector is created.
Pyramidal flow feature fusion (PFFF) technique is used to fuse the extracted features from
video and inertial sensor data. Finally, the fused features are given to the Support Vector
Machine (SVM) classifier to perform activity recognition.
1.15. Summary & Organization of the Thesis
This chapter presents the introduction of automatic human activity recognition,
Applications, Issues, and challenges of automatic human activity recognition applications.
Also, about security, while video surveillance system trust computation and trust
management systems. This chapter implicitly explains the need for security and automatic
detection using a trust computation among video surveillance in the automatic human activity
recognition. From this chapter, anyone can read and understand the functionalities of the
entire thesis concept in a better manner.
The remainder of the thesis is organized as follows:

Chapter 2 provides a comprehensive survey of the research already done by different
researchers on various aspects of HAR. It provides a detailed survey of the state-of-art
techniques in key frame extraction and feature extraction using hand-crafted features and
deep learning features. This chapter also brings out the direction of this research work.
Chapter 3 describes the framework of the HAR system that has been used in this
research work. In addition, this chapter provides automatic representative framelets selection
for HAR in videos, a contribution of this work for reducing the computational time.
Chapter 4 discusses the descriptor proposed for extraction of features that is
significant in discrimination of action recognition and the use of deep learning approach for
reducing the dimensionality.
Chapter 5 presents the contribution of this research work in feature fusion technique
which overcomes the limitation of high dimensionality, and also handles the view and scale
variance.
Chapter 6 presents the conclusion derived from the proposed works and future
enhancement.

Chapter 1_Introduction.docx

Recommended

Recommended

More Related Content

Similar to Chapter 1_Introduction.docx

Similar to Chapter 1_Introduction.docx (20)

Recently uploaded

Recently uploaded (20)

Chapter 1_Introduction.docx