Imran2016

Human Action Recognition Using RGB-D Sensor and
Deep Convolutional Neural Networks
Javed Imran
Department of Computer Science and Engineering
IIT Roorkee
Roorkee, India
javed.csit@gmail.com
Praveen Kumar
Department of Computer Science and Engineering
Visvesvaraya National Institute of Technology
Nagpur, India
praveen.kverma@gmail.com
Abstract— In this paper, we propose an approach to recognize
human actions by the fusion of RGB and Depth data. Firstly,
Motion History Images (MHI) are generated from RGB videos
which represent the temporal information about the action. Then
the original depth data is rotated in 3D point clouds and three
Depth Motion Maps (DMM) are generated over the entire depth
sequence corresponding to the front, side and top projection
views. A 4 Channel Deep Convolutional Neural Network is
trained, where the first channel is for classifying MHIs and the
remaining three for the front, side and top view generated from
depth data respectively. The proposed method is evaluated on
publically available UTD-MHAD dataset which contains both
RGB and depth videos. Experimental results show that
combining two modalities gives better recognition accuracy than
using each modality individually.
I. INTRODUCTION
Human action recognition is one of the most important topics
in computer vision. It has applications in the field of security
systems, human-computer interaction, video indexing and
querying, content based video analytics, web-video search and
retrieval, bio-mechanics, monitoring and intelligent
environments. Prior to the release of depth cameras, research
on action recognition was mainly focused on learning and
recognizing actions from image sequences captured by
traditional RGB video cameras [1]. But with the introduction
of low cost depth sensors like Microsoft Kinect and ASUS
Xtion, many researchers have focused on action recognition
using depth information [2-4]. The depth cameras have several
advantages as compared to RGB cameras. For example, the
outputs of depth cameras are insensitive to changes in
lightning conditions. Secondly, the 3D structure and shape
information provided by the depth maps makes it easier to deal
with problems like segmentation and detection.
Recently Deep Convolutional Neural Networks (CNN) have
given state-of-the-art performance in image recognition,
segmentation and classification task [5-6]. In this paper, we
have used a pretrained ImageNet model because no such large
RGB-D dataset exists which can train a deep CNN from
scratch. This approach bears similarity to other multi-stream
approaches [7-9]. As suggested in [8], we first rotate the 3D
point clouds constructed from original depth data to handle
view invariance. These rotated depth frames are used to
generate depth motion maps (DMMs) by accumulating motion
energy in three projected views [10]. Motion History Images
(MHIs) are generated from RGB videos where the intensity of
each pixel is a function of the recency of motion in a sequence
[11]. Four CNNs are trained separately corresponding to front
view, side view, top view and MHI, and their results are fused
to produce the final classification score.
The rest of this paper is organized as follows. In Section 2,
related works are presented. In section 3, implementation
details are discussed which includes MHI and DMM
generation along with the proposed 4-stream CNN
architecture. Section 4 describes the experimental results.
Section 5 concludes the paper.
II. RELATED WORK
Motion History Image based action recognition has been
actively studied since Bobick and Davis proposed the concept
of Motion Energy Image (MEI) and Motion History Image
(MHI) to recognize many types of aerobics exercises [11].
Though the computation of MHI is computationally
inexpensive, but this template matching approach is
susceptible to noise and variations in performing same actions
by different individuals. In [12], Meng et al. combined Motion
History Image (MHI) and Modified Motion History Image
(MMHI) and used SVM_2K as linear binary classifier. With
the release of Kinect camera in 2010, researchers shifted their
focus on action recognition based on depth data. In [2], Li et
al. construct an action graph to describe the pattern of the
action. Their action graph consists of multiple nodes where
each node represents a set of salient postures that is
characterized by a bag of 3D points. But this sampling scheme
is view dependent. In [3], a view invariant Histogram of 3D
joint locations (HOJ3D) were calculated from action depth
sequences. They are reprojected using LDA and different
clusters are formed based on similar posture visual words.
Discrete hidden Markrov Models were used to model the
temporal evolutions of these visual words. In [4], Depth
Motion Maps (DMM) are generated by projecting depth maps
onto three orthogonal planes. Histogram of Gradient (HOG)
features are then extracted from DMMs and classified using
linear SVM. In [10], normalized DMMs are generated by
absolute differencing between two consecutive depth maps
without thresholding, and an l2-regularized classifier is
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
978-1-5090-2029-4/16/$31.00 @2016 IEEE 144

employed for action recognition. In general, all the above
mentioned methods are based on hand-crafted features which
are either time consuming or dataset dependent.
With the recent success of Convolutional Neural Networks
[5], deep neural architectures are widely used in the area of
image and video classification tasks [13, 7]. The availability of
pretrained ImageNet models [14] further leverages the
researchers to apply them in the domain of RGB-D action and
object recognition. In [15], Ji et al. have used 3-dimensional
(3D) CNN model for action recognition. This model extracts
features from both spatial and temporal dimensions by
performing 3D convolutions. The first layer in their
architecture was hardwired to encode prior knowledge on
features. Thus, it is not clear how this approach will perform
when applied to a new dataset. In [7], Simonyan and
Zisserman have proposed a Two-Stream CNN for action
recognition in videos. They have used the fact that a video can
be decomposed into spatial and temporal components. Each
stream is implemented using a deep ConvNet, softmax scores
of which are combined by late fusion. In [9], two separate
CNNs are trained for RGB and depth images. A jet colormap
is applied to depth images to convert them into three channel
image so as to make effective use of ImageNet pretrained
model. Concatenation of last layers of both CNNs into one
fully connected layer followed by a softmax classifier is used
for object recognition.
Our work bears the most similarity to [8], in which each depth
map is rotated in 3D point clouds before generating three
DMMs for front, side and top view. However, we have also
used RGB video as the second modality to generate MHIs. A
four-channel CNNs is then trained, and their scores are
combined by late fusion to give the classification score.
III. MHI, DMM AND 4-STREAM CNN
In this section, we describe the generation of Motion History
Image from RGB videos and Depth Motion Maps from depth
data. We also discuss the input preprocessing, proposed CNN
architecture, network training and class score fusion.
A. Motion History Image
The concept of Motion History Image was proposed by
Bobick and Davis [11] in 2001. They proposed the generation
of motion energy image (MEI), which captures the occurrence
of motion in a video sequence in one image. Next they
generated motion history image (MHI), which gives the
temporal information of the motion in the image plane. The
brightness of the pixels in this image is higher where the
motion has occurred more recently as compared to where the
motion has occurred earlier.
Addition of both the MHI and MEI generate the MHI for a
video sample. The MHI H (x,y,t) can be computed from an
update function D(x,y,t) [17]:
H (x, y, t) = (1)
Here, (x, y) represents the pixel location, t denotes the time, D
(x, y, t) shows the object presence (or motion) in the current
video image, the duration τ governs the temporal extent of the
movement, and δ is the decay parameter.
B. Depth Motion Maps
Yang et. al [4] proposed the concept of Depth Motion Maps
(DMM) to capture the 3D structure and depth information.
The depth images in the entire depth video sequence are
projected onto three orthogonal planes. Then the absolute
difference between consecutive projected depth maps are
calculated and combined to form three 2D depth motion maps.
Before projection, the depth data is rotated in 3D point clouds
as discussed in [8]. This is done to handle the problem of view
invariance and also provides a method to generate more
training data. Finally, the DMM is generated as follows [10]:
(2)
where is the projected map of ith
frame under projection
view v {front, side, top}.
C. Input preprocessing
The MHI and three DMMs (front view, side view and top
view) generated using the techniques discussed above are in
grayscale. So we colorize them into 3 channel RGB images so
as to fully utilize the power of CNN pre-trained on ImageNet.
For the MHIs, we apply five different colormaps: copper, hot,
pink, gray and bone available in Matlab R2015. The rotated
DMMs are colorized using the improved rainbow
pseudocoloring technique proposed in [9]. Fig. 1 and 2 shows
the result after coloring. The resulting images are then resized
to 224×224 so as to make them compatible with the pre-
trained ImageNet model.
(a) Original MHI
(b) MHIs obtained after applying different colormaps
Fig. 1. Examples of original and colored MHIs for Swipe-right action
D. Proposed CNN Architecture
In [6], Simonyan & Zisserman have shown that very high
recognition accuracy can be achieved by using very deep
architecture and filters with very small (3×3) receptive fields.
145

(a) Original front, side and top view DMM
(b) DMMs obtained after applying pseudocoloring
Fig. 2. Examples of DMM for Wave action
In addition to this, very deep models generalize well to other
datasets. This motivated us to use their pre-trained VGG-16
model for training our network. VGG-16 consists of sixteen
layers, including thirteen convolutional layers and three fully
connected layers. In our proposed architecture, four such
networks are combined by fusing their softmax scores as
shown in Fig 3. Two fusion schemes are analyzed; first is
average rule, and second is product rule.
E. Network Training & Class Score Fusion
Four CNNs are trained; one for the color coded MHIs and
remaining three for front, side and top view DMMs
respectively. Dropout layer with ratio set to 0.8 is added
between the last two fully connected layers to avoid
overfitting. The learning rate is set to 10-4
, weight decay set to
0.0005 and momentum set to 0.9 with a batch size of 16. The
entire network is trained using MatConvNet [18] toolbox on a
system with NVIDIA Quadro K4200 GPU. During testing, the
posterior probabilities generated by the softmax layer of four
CNNs are combined using average and product rule.
IV. EXPERIMENTAL RESULTS
Our proposed framework is evaluated on publically available
UTD-MHAD dataset [18] which contains both RGB and depth
data captured using Kinect. It contains 27 actions as shown in
Fig. 4. The same experimental setting in [18] is followed
where the data from the subject numbers 1, 3, 5, 7 is used for
training, and the data for the subject numbers 2, 4, 6, 8 is used
for testing. The results are given in Table 1 and the individual
class accuracy is shown in Fig. 5.
Fig. 3. DMM and MHI based 4-stream deep CNN architecture for action recognition
146

Swipe-left Swipe-right Wave Clap Throw Arm-cross
Basketball-shoot Draw X Draw-circle-CC Draw-circle-CCW Draw-traingle Bowling
Boxing Baseball-swing Tennis-swing Arm-curl Tennis-serve Push
Knock Catch Pickup-Throw Jog Walk Sit-to-stand
Stand-to-sit Lunge Squat
Fig. 4. Samples of UTD-MHAD dataset
Fig. 5. Class-specific accuracy for UTD-MHAD dataset (Using product rule based decision-level fusion)xample of a figure caption. (figure caption)
147

TABLE I. COMPARISON OF RECOGNITION ACCURACY ON UTD-MHAD
DATASET
Method Accuracy (%)
C. Chen et al. [18] 79.1
Bulbul et al. [19] 88.4
Ours Depth (Avg. of front, side & top DMM)
RGB (using only MHI)
Depth + RGB (Average Rule)
Depth + RGB (Product Rule)
87.9
70.0
88.8
91.2
V. CONCLUSION
In this paper, we have presented a deep convolutional neural
network based framework classify human actions based on
RGB-D data. The experimental results on UTD-MHAD
dataset demonstrates that fusion of different modalities can
give better performance than using each modality individually.
Our approach also proves to be robust and efficient than
traditional hand-crafted based feature extraction techniques.
State-of-the-art results can be achieved even on a small dataset
by fine tuning a pre-trained model like VGG-16. In the future,
we will combine other modalities like skeleton stream and
handle confusion between similar classes by applying
Dempster-Shafer Belief theory.
REFERENCES
[1] Aggarwal, Jake K., and Michael S. Ryoo. "Human activity analysis: A
review." ACM Computing Surveys (CSUR) 43.3 (2011): 16.
[2] Li, Wanqing, Zhengyou Zhang, and Zicheng Liu. "Action recognition
based on a bag of 3d points." Computer Vision and Pattern Recognition
Workshops (CVPRW), 2010 IEEE Computer Society Conference on.
IEEE, 2010.
[3] Xia, Lu, Chia-Chih Chen, and J. K. Aggarwal. "View invariant human
action recognition using histograms of 3d joints." Computer Vision and
Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer
Society Conference on. IEEE, 2012.
[4] Yang, Xiaodong, Chenyang Zhang, and YingLi Tian. "Recognizing
actions using depth motion maps-based histograms of oriented
gradients."Proceedings of the 20th ACM international conference on
Multimedia. ACM, 2012.
[5] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet
classification with deep convolutional neural networks." Advances in
neural information processing systems. 2012.
[6] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional
networks for large-scale image recognition." arXiv preprint
arXiv:1409.1556(2014).
[7] Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional
networks for action recognition in videos." Advances in Neural
Information Processing Systems. 2014.
[8] Wang, Pichao, et al. "ConvNets-Based Action Recognition from Depth
Maps through Virtual Cameras and Pseudocoloring." Proceedings of the
23rd Annual ACM Conference on Multimedia Conference. ACM, 2015.
[9] Eitel, Andreas, et al. "Multimodal deep learning for robust RGB-D
object recognition." Intelligent Robots and Systems (IROS), 2015
IEEE/RSJ International Conference on. IEEE, 2015.
[10] Chen, Chen, Kui Liu, and Nasser Kehtarnavaz. "Real-time human action
recognition based on depth motion maps." Journal of Real-Time Image
Processing (2013): 1-9.
[11] Bobick, Aaron F., and James W. Davis. "The recognition of human
movement using temporal templates." Pattern Analysis and Machine
Intelligence, IEEE Transactions on 23.3 (2001): 257-267.
[12] Meng, Hongying, et al. "Motion history histograms for human action
recognition." Embedded Computer Vision. Springer London, 2009. 139-
162.
[13] Krizhevsky, Alex, and Geoffrey E. Hinton. "Using very deep
autoencoders for content-based image retrieval." ESANN. 2011.
[14] http://www.vlfeat.org/matconvnet/pretrained/
[15] Ji, Shuiwang, et al. "3D convolutional neural networks for human action
recognition." Pattern Analysis and Machine Intelligence, IEEE
Transactions on 35.1 (2013): 221-231.
[16] Ahad, Md Atiqur Rahman, et al. "Motion history image: its variants and
applications." Machine Vision and Applications 23.2 (2012): 255-281.
[17] Vedaldi, Andrea, and Karel Lenc. "MatConvNet: Convolutional neural
networks for matlab." Proceedings of the 23rd Annual ACM Conference
on Multimedia Conference. ACM, 2015.
[18] Chen, Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. "UTD-MHAD: a
multimodal dataset for human action recognition utilizing a depth
camera and a wearable inertial sensor." Image Processing (ICIP), 2015
IEEE International Conference on. IEEE, 2015.
[19] Bulbul, Mohammad Farhad, Yunsheng Jiang, and Jinwen Ma. "DMMs-
Based Multiple Features Fusion for Human Action
Recognition." International Journal of Multimedia Data Engineering
and Management (IJMDEM) 6.4 (2015): 23-39.
148

Imran2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Imran2016

Similar to Imran2016 (20)

Recently uploaded

Recently uploaded (20)

Imran2016