3. Introduction
Computer vision is a field of artificial intelligence that trains computers
to interpret and understand the visual world.[1]
Using digital images from cameras , videos and deep learning models,
machines can accurately identify and classify objects and then react to
what they see.[1]
One of the most popular invention in computer vision is Human Action
Recognition.
4. Human action recognition
It is a process of identifying human actions that occur in video
sequences.
Human activities, such as “walking” and “running,” arise very
naturally in daily life and are relatively easy to recognize.[1]
On the other hand, more complex activities, such as “peeling an
apple,” are more difficult to identify.[1]
5. Recognizing human activities from video sequences or still images is
a challenging task due to problems, such as background clutter, partial
occlusion, changes in scale, viewpoint, lighting, and appearance.[1]
Many applications, including video surveillance systems, human-
computer interaction, and robotics for human behavior
characterization, require a multiple activity recognition system.
Challenges
6. Different scales[2]
People may appear at different scales in different videos, yet
perform the same action.
Movement of the camera[2]
The camera may be a handheld camera, and the person holding
it can cause it to shake.
Camera may be mounted on something that moves.
Challenges continued..
7. Challenges continued..
Movement with the camera[2]
The subject performing an action (i.e., skating) may be moving
with the camera at a similar speed.
9. Challenges continued..
Background “clutter”
Other objects/humans present in the video frame.
Human variation
Humans are of different sizes/shapes
Action variation
Different people perform different actions in different ways.
10. Solution of difficulties in Human Action Recognition
To overcome the above mentioned problems, a task is required
that consists of three components, namely:[3]
Background subtraction
Human tracking
Human action and object detection
11. Background subtraction, in which the system attempts to separate
the parts of the image that are invariant over time (background) from
the objects that are moving or changing (foreground).
Human tracking, in which the system locates human motion over
time.
Human action and object detection, in which the system is able to
localize a human activity in an image.
Solution of difficulties in Human Action Recognition..
12. Decomposition of human action recognition
The goal of human activity recognition is to examine activities from
video sequences or still images. Motivated by this fact, human
activity recognition systems aim to correctly classify input data into
its underlying activity category.[4]
(i) Gestures
(ii) Atomic actions
(iii) Human-to-object or human-to-human interactions
13. (iv) Group actions
(v) Behaviors
(vi) Events
Decomposition of human action recognition…….
14. Human Action Categorization
categorize the human activity recognition methods into two main
categories:[5]
Unimodal
Multimodal activity recognition methods
Each of these two categories is further analyzed into sub-categories
depending on how they model human activities.
16. Unimodal methods
Unimodal methods represent human activities from data of a single
modality, such as images.
Unimodal approaches are appropriate for recognizing human
activities based on motion features.
However, the ability to recognize the underlying class only from
motion is on its own a challenging task.
17. Space-time methods involve activity recognition methods, which
represent human activities as a set of spatiotemporal features.
Stochastic methods recognize activities by applying statistical
models to represent human actions.
Rule-based methods use a set of rules to describe human
activities.
Shape-based methods efficiently represent activities with high-
level reasoning by modeling the motion of human body parts. [5]
Unimodal methods……..
18. TYPE OF METHODS PROS CONS
SPACE-TIME
Localisation of actions Sensitivity to noise and
occlusion
3D body representation Recognizing complex activities
may be tricky
Good representation of low-
level features
Feature sparsity leads to low
repeatability
Detailed analysis of human
movements
Gap between low-level features
and high-level events
STOCHASTIC[6]
Modeling of human interactions Label bias problem
Recognition from very short
clips
Approximate solution
Camera motion handling Large number of training data
required
Comparison of Unimodal methods
19. TYPE OF METHODS PROS CONS
RULE-BASED
Sequential activity recognition Only atomic actions are
required
High-level representation of
human actions
Problem with long video
sequences
Context free grammar
classification
Rule/attribute generation is
difficult
SHAPE-BASED
2D and 3D body representation Skeleton tracking inaccuracies
Recognition from still images Sensitivity to illumination
changes and human clothing
Upper body action recognition Large number of degrees o
freedom
Comparison of Unimodal methods....
20. Multimodal methods
The different modalities is an important issue for understanding the
data.
In that context, audio-visual analysis is used in many applications
not only for audio-visual synchronization but also for tracking and
activity recognition.
21. Multimodal methods………
Affective methods represent human activities according to emotional
communications and the affective state of a person.
Behavioral methods aim to recognize behavioral attributes, non-verbal
multimodal cues, such as gestures, facial expressions, and auditory cues.
Social networking methods model the characteristics and the behavior
of humans in several layers of human-to-human interactions in social
events from gestures, body motion, and speech.
22. Comparison of Multimodal methods[7]
TYPE OF METHODS PROS CONS
AFFECTIVE
Association of human emotions
and actions
Problems in handling
continuous actions
Better understanding of human
activities
Dimensionality of the different
modality
BEHAVIORAL
Recognize human interactions Emotional attributes
specification is difficult
Psychological attributes
improve recognition
Complex classification model
SOCIAL NETWORKING
Abnormal activity recognition Limited by number of
interacting persons
Easy access to data through
social platforms
Difficulties in crowded scene
modeling due to occlusion.
23. Conclusion
A comprehensive study of state-of-the-art methods of human action recognition had
been carried out.
Human action recognition is classified into two broad categories (Unimodal and
Multimodal) each of these approaches employ to recognize human activities.
Unimodal approaches are discussed and provided an internal categorization of these
methods, which were developed for analyzing gesture, atomic actions, and more complex
activities, either directly or employing activity decomposition into simpler actions.
Multimodal approaches are also presented for the analysis of human social behaviors
and interactions.
24. References
1) M. Vrigkas, C. Nikou and L.A. Kakadiaris, “A review of human activity recognition
methods”, in frontiers in robotics and AI 10,3389, 2015.
2) A. Alahi, V. Ramanathan and L. Fei-Fei, “Socially-aware large-scale crowd
forecasting,” in Proc. IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (Columbus, OH), 2211–2218, 2014.
3) Z. Akata, F. Perronnin, Z. Harchaoui and C. Schmid, “Label-embedding for
attribute-based classification,” in Proc. IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (Portland), 819–826, 2013.
4) J. K. Aggarwal, and L. Xia, Human activity recognition from 3D data: a
review. Pattern Recognition. Lett. 48, 70–80, 2014.
25. References……
5) R. H. Baxter, N. M. Robertson, and D. M. Lane, Human behaviour recognition in
data-scarce domains. Pattern Recognition. 48, 2377–2393, 2015.
6) P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic, “Finding actors
and actions in movies,” in Proc. IEEE International Conference on Computer
Vision (Sydney), 2280–2287, 2013.
7) H. Chen, J. Li, F. Zhang, Y. Li, and H. Wang, “3D model-based continuous
emotion recognition,” in Proc. IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (Boston, MA), 1836–1845, 2015.