SlideShare a Scribd company logo
Human Action Recognition Using RGB-D Sensor and
Deep Convolutional Neural Networks
Javed Imran
Department of Computer Science and Engineering
IIT Roorkee
Roorkee, India
javed.csit@gmail.com
Praveen Kumar
Department of Computer Science and Engineering
Visvesvaraya National Institute of Technology
Nagpur, India
praveen.kverma@gmail.com
Abstract— In this paper, we propose an approach to recognize
human actions by the fusion of RGB and Depth data. Firstly,
Motion History Images (MHI) are generated from RGB videos
which represent the temporal information about the action. Then
the original depth data is rotated in 3D point clouds and three
Depth Motion Maps (DMM) are generated over the entire depth
sequence corresponding to the front, side and top projection
views. A 4 Channel Deep Convolutional Neural Network is
trained, where the first channel is for classifying MHIs and the
remaining three for the front, side and top view generated from
depth data respectively. The proposed method is evaluated on
publically available UTD-MHAD dataset which contains both
RGB and depth videos. Experimental results show that
combining two modalities gives better recognition accuracy than
using each modality individually.
I. INTRODUCTION
Human action recognition is one of the most important topics
in computer vision. It has applications in the field of security
systems, human-computer interaction, video indexing and
querying, content based video analytics, web-video search and
retrieval, bio-mechanics, monitoring and intelligent
environments. Prior to the release of depth cameras, research
on action recognition was mainly focused on learning and
recognizing actions from image sequences captured by
traditional RGB video cameras [1]. But with the introduction
of low cost depth sensors like Microsoft Kinect and ASUS
Xtion, many researchers have focused on action recognition
using depth information [2-4]. The depth cameras have several
advantages as compared to RGB cameras. For example, the
outputs of depth cameras are insensitive to changes in
lightning conditions. Secondly, the 3D structure and shape
information provided by the depth maps makes it easier to deal
with problems like segmentation and detection.
Recently Deep Convolutional Neural Networks (CNN) have
given state-of-the-art performance in image recognition,
segmentation and classification task [5-6]. In this paper, we
have used a pretrained ImageNet model because no such large
RGB-D dataset exists which can train a deep CNN from
scratch. This approach bears similarity to other multi-stream
approaches [7-9]. As suggested in [8], we first rotate the 3D
point clouds constructed from original depth data to handle
view invariance. These rotated depth frames are used to
generate depth motion maps (DMMs) by accumulating motion
energy in three projected views [10]. Motion History Images
(MHIs) are generated from RGB videos where the intensity of
each pixel is a function of the recency of motion in a sequence
[11]. Four CNNs are trained separately corresponding to front
view, side view, top view and MHI, and their results are fused
to produce the final classification score.
The rest of this paper is organized as follows. In Section 2,
related works are presented. In section 3, implementation
details are discussed which includes MHI and DMM
generation along with the proposed 4-stream CNN
architecture. Section 4 describes the experimental results.
Section 5 concludes the paper.
II. RELATED WORK
Motion History Image based action recognition has been
actively studied since Bobick and Davis proposed the concept
of Motion Energy Image (MEI) and Motion History Image
(MHI) to recognize many types of aerobics exercises [11].
Though the computation of MHI is computationally
inexpensive, but this template matching approach is
susceptible to noise and variations in performing same actions
by different individuals. In [12], Meng et al. combined Motion
History Image (MHI) and Modified Motion History Image
(MMHI) and used SVM_2K as linear binary classifier. With
the release of Kinect camera in 2010, researchers shifted their
focus on action recognition based on depth data. In [2], Li et
al. construct an action graph to describe the pattern of the
action. Their action graph consists of multiple nodes where
each node represents a set of salient postures that is
characterized by a bag of 3D points. But this sampling scheme
is view dependent. In [3], a view invariant Histogram of 3D
joint locations (HOJ3D) were calculated from action depth
sequences. They are reprojected using LDA and different
clusters are formed based on similar posture visual words.
Discrete hidden Markrov Models were used to model the
temporal evolutions of these visual words. In [4], Depth
Motion Maps (DMM) are generated by projecting depth maps
onto three orthogonal planes. Histogram of Gradient (HOG)
features are then extracted from DMMs and classified using
linear SVM. In [10], normalized DMMs are generated by
absolute differencing between two consecutive depth maps
without thresholding, and an l2-regularized classifier is
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
978-1-5090-2029-4/16/$31.00 @2016 IEEE 144
employed for action recognition. In general, all the above
mentioned methods are based on hand-crafted features which
are either time consuming or dataset dependent.
With the recent success of Convolutional Neural Networks
[5], deep neural architectures are widely used in the area of
image and video classification tasks [13, 7]. The availability of
pretrained ImageNet models [14] further leverages the
researchers to apply them in the domain of RGB-D action and
object recognition. In [15], Ji et al. have used 3-dimensional
(3D) CNN model for action recognition. This model extracts
features from both spatial and temporal dimensions by
performing 3D convolutions. The first layer in their
architecture was hardwired to encode prior knowledge on
features. Thus, it is not clear how this approach will perform
when applied to a new dataset. In [7], Simonyan and
Zisserman have proposed a Two-Stream CNN for action
recognition in videos. They have used the fact that a video can
be decomposed into spatial and temporal components. Each
stream is implemented using a deep ConvNet, softmax scores
of which are combined by late fusion. In [9], two separate
CNNs are trained for RGB and depth images. A jet colormap
is applied to depth images to convert them into three channel
image so as to make effective use of ImageNet pretrained
model. Concatenation of last layers of both CNNs into one
fully connected layer followed by a softmax classifier is used
for object recognition.
Our work bears the most similarity to [8], in which each depth
map is rotated in 3D point clouds before generating three
DMMs for front, side and top view. However, we have also
used RGB video as the second modality to generate MHIs. A
four-channel CNNs is then trained, and their scores are
combined by late fusion to give the classification score.
III. MHI, DMM AND 4-STREAM CNN
In this section, we describe the generation of Motion History
Image from RGB videos and Depth Motion Maps from depth
data. We also discuss the input preprocessing, proposed CNN
architecture, network training and class score fusion.
A. Motion History Image
The concept of Motion History Image was proposed by
Bobick and Davis [11] in 2001. They proposed the generation
of motion energy image (MEI), which captures the occurrence
of motion in a video sequence in one image. Next they
generated motion history image (MHI), which gives the
temporal information of the motion in the image plane. The
brightness of the pixels in this image is higher where the
motion has occurred more recently as compared to where the
motion has occurred earlier.
Addition of both the MHI and MEI generate the MHI for a
video sample. The MHI H (x,y,t) can be computed from an
update function D(x,y,t) [17]:
H (x, y, t) = (1)
Here, (x, y) represents the pixel location, t denotes the time, D
(x, y, t) shows the object presence (or motion) in the current
video image, the duration τ governs the temporal extent of the
movement, and δ is the decay parameter.
B. Depth Motion Maps
Yang et. al [4] proposed the concept of Depth Motion Maps
(DMM) to capture the 3D structure and depth information.
The depth images in the entire depth video sequence are
projected onto three orthogonal planes. Then the absolute
difference between consecutive projected depth maps are
calculated and combined to form three 2D depth motion maps.
Before projection, the depth data is rotated in 3D point clouds
as discussed in [8]. This is done to handle the problem of view
invariance and also provides a method to generate more
training data. Finally, the DMM is generated as follows [10]:
(2)
where is the projected map of ith
frame under projection
view v {front, side, top}.
C. Input preprocessing
The MHI and three DMMs (front view, side view and top
view) generated using the techniques discussed above are in
grayscale. So we colorize them into 3 channel RGB images so
as to fully utilize the power of CNN pre-trained on ImageNet.
For the MHIs, we apply five different colormaps: copper, hot,
pink, gray and bone available in Matlab R2015. The rotated
DMMs are colorized using the improved rainbow
pseudocoloring technique proposed in [9]. Fig. 1 and 2 shows
the result after coloring. The resulting images are then resized
to 224×224 so as to make them compatible with the pre-
trained ImageNet model.
(a) Original MHI
(b) MHIs obtained after applying different colormaps
Fig. 1. Examples of original and colored MHIs for Swipe-right action
D. Proposed CNN Architecture
In [6], Simonyan & Zisserman have shown that very high
recognition accuracy can be achieved by using very deep
architecture and filters with very small (3×3) receptive fields.
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
145
(a) Original front, side and top view DMM
(b) DMMs obtained after applying pseudocoloring
Fig. 2. Examples of DMM for Wave action
In addition to this, very deep models generalize well to other
datasets. This motivated us to use their pre-trained VGG-16
model for training our network. VGG-16 consists of sixteen
layers, including thirteen convolutional layers and three fully
connected layers. In our proposed architecture, four such
networks are combined by fusing their softmax scores as
shown in Fig 3. Two fusion schemes are analyzed; first is
average rule, and second is product rule.
E. Network Training & Class Score Fusion
Four CNNs are trained; one for the color coded MHIs and
remaining three for front, side and top view DMMs
respectively. Dropout layer with ratio set to 0.8 is added
between the last two fully connected layers to avoid
overfitting. The learning rate is set to 10-4
, weight decay set to
0.0005 and momentum set to 0.9 with a batch size of 16. The
entire network is trained using MatConvNet [18] toolbox on a
system with NVIDIA Quadro K4200 GPU. During testing, the
posterior probabilities generated by the softmax layer of four
CNNs are combined using average and product rule.
IV. EXPERIMENTAL RESULTS
Our proposed framework is evaluated on publically available
UTD-MHAD dataset [18] which contains both RGB and depth
data captured using Kinect. It contains 27 actions as shown in
Fig. 4. The same experimental setting in [18] is followed
where the data from the subject numbers 1, 3, 5, 7 is used for
training, and the data for the subject numbers 2, 4, 6, 8 is used
for testing. The results are given in Table 1 and the individual
class accuracy is shown in Fig. 5.
Fig. 3. DMM and MHI based 4-stream deep CNN architecture for action recognition
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
146
Swipe-left Swipe-right Wave Clap Throw Arm-cross
Basketball-shoot Draw X Draw-circle-CC Draw-circle-CCW Draw-traingle Bowling
Boxing Baseball-swing Tennis-swing Arm-curl Tennis-serve Push
Knock Catch Pickup-Throw Jog Walk Sit-to-stand
Stand-to-sit Lunge Squat
Fig. 4. Samples of UTD-MHAD dataset
Fig. 5. Class-specific accuracy for UTD-MHAD dataset (Using product rule based decision-level fusion)xample of a figure caption. (figure caption)
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
147
TABLE I. COMPARISON OF RECOGNITION ACCURACY ON UTD-MHAD
DATASET
Method Accuracy (%)
C. Chen et al. [18] 79.1
Bulbul et al. [19] 88.4
Ours Depth (Avg. of front, side & top DMM)
RGB (using only MHI)
Depth + RGB (Average Rule)
Depth + RGB (Product Rule)
87.9
70.0
88.8
91.2
V. CONCLUSION
In this paper, we have presented a deep convolutional neural
network based framework classify human actions based on
RGB-D data. The experimental results on UTD-MHAD
dataset demonstrates that fusion of different modalities can
give better performance than using each modality individually.
Our approach also proves to be robust and efficient than
traditional hand-crafted based feature extraction techniques.
State-of-the-art results can be achieved even on a small dataset
by fine tuning a pre-trained model like VGG-16. In the future,
we will combine other modalities like skeleton stream and
handle confusion between similar classes by applying
Dempster-Shafer Belief theory.
REFERENCES
[1] Aggarwal, Jake K., and Michael S. Ryoo. "Human activity analysis: A
review." ACM Computing Surveys (CSUR) 43.3 (2011): 16.
[2] Li, Wanqing, Zhengyou Zhang, and Zicheng Liu. "Action recognition
based on a bag of 3d points." Computer Vision and Pattern Recognition
Workshops (CVPRW), 2010 IEEE Computer Society Conference on.
IEEE, 2010.
[3] Xia, Lu, Chia-Chih Chen, and J. K. Aggarwal. "View invariant human
action recognition using histograms of 3d joints." Computer Vision and
Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer
Society Conference on. IEEE, 2012.
[4] Yang, Xiaodong, Chenyang Zhang, and YingLi Tian. "Recognizing
actions using depth motion maps-based histograms of oriented
gradients."Proceedings of the 20th ACM international conference on
Multimedia. ACM, 2012.
[5] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet
classification with deep convolutional neural networks." Advances in
neural information processing systems. 2012.
[6] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional
networks for large-scale image recognition." arXiv preprint
arXiv:1409.1556(2014).
[7] Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional
networks for action recognition in videos." Advances in Neural
Information Processing Systems. 2014.
[8] Wang, Pichao, et al. "ConvNets-Based Action Recognition from Depth
Maps through Virtual Cameras and Pseudocoloring." Proceedings of the
23rd Annual ACM Conference on Multimedia Conference. ACM, 2015.
[9] Eitel, Andreas, et al. "Multimodal deep learning for robust RGB-D
object recognition." Intelligent Robots and Systems (IROS), 2015
IEEE/RSJ International Conference on. IEEE, 2015.
[10] Chen, Chen, Kui Liu, and Nasser Kehtarnavaz. "Real-time human action
recognition based on depth motion maps." Journal of Real-Time Image
Processing (2013): 1-9.
[11] Bobick, Aaron F., and James W. Davis. "The recognition of human
movement using temporal templates." Pattern Analysis and Machine
Intelligence, IEEE Transactions on 23.3 (2001): 257-267.
[12] Meng, Hongying, et al. "Motion history histograms for human action
recognition." Embedded Computer Vision. Springer London, 2009. 139-
162.
[13] Krizhevsky, Alex, and Geoffrey E. Hinton. "Using very deep
autoencoders for content-based image retrieval." ESANN. 2011.
[14] http://www.vlfeat.org/matconvnet/pretrained/
[15] Ji, Shuiwang, et al. "3D convolutional neural networks for human action
recognition." Pattern Analysis and Machine Intelligence, IEEE
Transactions on 35.1 (2013): 221-231.
[16] Ahad, Md Atiqur Rahman, et al. "Motion history image: its variants and
applications." Machine Vision and Applications 23.2 (2012): 255-281.
[17] Vedaldi, Andrea, and Karel Lenc. "MatConvNet: Convolutional neural
networks for matlab." Proceedings of the 23rd Annual ACM Conference
on Multimedia Conference. ACM, 2015.
[18] Chen, Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. "UTD-MHAD: a
multimodal dataset for human action recognition utilizing a depth
camera and a wearable inertial sensor." Image Processing (ICIP), 2015
IEEE International Conference on. IEEE, 2015.
[19] Bulbul, Mohammad Farhad, Yunsheng Jiang, and Jinwen Ma. "DMMs-
Based Multiple Features Fusion for Human Action
Recognition." International Journal of Multimedia Data Engineering
and Management (IJMDEM) 6.4 (2015): 23-39.
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
148

More Related Content

What's hot

Introduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLABIntroduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLAB
Ray Phan
 
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
Seiya Ito
 
Dip review
Dip reviewDip review
Dip review
Harish Reddy
 
Survey paper on image compression techniques
Survey paper on image compression techniquesSurvey paper on image compression techniques
Survey paper on image compression techniques
IRJET Journal
 
A Review on Image Compression using DCT and DWT
A Review on Image Compression using DCT and DWTA Review on Image Compression using DCT and DWT
A Review on Image Compression using DCT and DWT
IJSRD
 
Deep learning for image video processing
Deep learning for image video processingDeep learning for image video processing
Deep learning for image video processing
Yu Huang
 
A REGULARIZED ROBUST SUPER-RESOLUTION APPROACH FORALIASED IMAGES AND LOW RESO...
A REGULARIZED ROBUST SUPER-RESOLUTION APPROACH FORALIASED IMAGES AND LOW RESO...A REGULARIZED ROBUST SUPER-RESOLUTION APPROACH FORALIASED IMAGES AND LOW RESO...
A REGULARIZED ROBUST SUPER-RESOLUTION APPROACH FORALIASED IMAGES AND LOW RESO...
cscpconf
 
Digital image processing
Digital image processingDigital image processing
Digital image processing
Trishna Pattanaik
 
Digital Image Processing
Digital Image ProcessingDigital Image Processing
Digital Image Processing
Azharo7
 
Introduction of image processing
Introduction of image processingIntroduction of image processing
Introduction of image processing
Avani Shah
 
Ijarcet vol-2-issue-7-2223-2229
Ijarcet vol-2-issue-7-2223-2229Ijarcet vol-2-issue-7-2223-2229
Ijarcet vol-2-issue-7-2223-2229Editor IJARCET
 
Image processing and compression techniques
Image processing and compression techniquesImage processing and compression techniques
Image processing and compression techniques
Ashwin Venkataraman
 
Introduction to digital image processing
Introduction to digital image processingIntroduction to digital image processing
Introduction to digital image processing
Hossain Md Shakhawat
 
Image compression using discrete wavelet transform
Image compression using discrete wavelet transformImage compression using discrete wavelet transform
Image compression using discrete wavelet transformHarshal Ladhe
 
application of digital image processing and methods
application of digital image processing and methodsapplication of digital image processing and methods
application of digital image processing and methods
SIRILsam
 
Different Approach of VIDEO Compression Technique: A Study
Different Approach of VIDEO Compression Technique: A StudyDifferent Approach of VIDEO Compression Technique: A Study
Different Approach of VIDEO Compression Technique: A Study
Editor IJCATR
 
Digital Image Fundamentals
Digital Image FundamentalsDigital Image Fundamentals
Digital Image Fundamentals
A B Shinde
 

What's hot (20)

I07015261
I07015261I07015261
I07015261
 
mini prjt
mini prjtmini prjt
mini prjt
 
Introduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLABIntroduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLAB
 
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
 
Dip review
Dip reviewDip review
Dip review
 
Survey paper on image compression techniques
Survey paper on image compression techniquesSurvey paper on image compression techniques
Survey paper on image compression techniques
 
A Review on Image Compression using DCT and DWT
A Review on Image Compression using DCT and DWTA Review on Image Compression using DCT and DWT
A Review on Image Compression using DCT and DWT
 
Deep learning for image video processing
Deep learning for image video processingDeep learning for image video processing
Deep learning for image video processing
 
A REGULARIZED ROBUST SUPER-RESOLUTION APPROACH FORALIASED IMAGES AND LOW RESO...
A REGULARIZED ROBUST SUPER-RESOLUTION APPROACH FORALIASED IMAGES AND LOW RESO...A REGULARIZED ROBUST SUPER-RESOLUTION APPROACH FORALIASED IMAGES AND LOW RESO...
A REGULARIZED ROBUST SUPER-RESOLUTION APPROACH FORALIASED IMAGES AND LOW RESO...
 
Digital image processing
Digital image processingDigital image processing
Digital image processing
 
Digital Image Processing
Digital Image ProcessingDigital Image Processing
Digital Image Processing
 
Introduction of image processing
Introduction of image processingIntroduction of image processing
Introduction of image processing
 
Ijarcet vol-2-issue-7-2223-2229
Ijarcet vol-2-issue-7-2223-2229Ijarcet vol-2-issue-7-2223-2229
Ijarcet vol-2-issue-7-2223-2229
 
Image mosaicing
Image mosaicingImage mosaicing
Image mosaicing
 
Image processing and compression techniques
Image processing and compression techniquesImage processing and compression techniques
Image processing and compression techniques
 
Introduction to digital image processing
Introduction to digital image processingIntroduction to digital image processing
Introduction to digital image processing
 
Image compression using discrete wavelet transform
Image compression using discrete wavelet transformImage compression using discrete wavelet transform
Image compression using discrete wavelet transform
 
application of digital image processing and methods
application of digital image processing and methodsapplication of digital image processing and methods
application of digital image processing and methods
 
Different Approach of VIDEO Compression Technique: A Study
Different Approach of VIDEO Compression Technique: A StudyDifferent Approach of VIDEO Compression Technique: A Study
Different Approach of VIDEO Compression Technique: A Study
 
Digital Image Fundamentals
Digital Image FundamentalsDigital Image Fundamentals
Digital Image Fundamentals
 

Similar to Imran2016

538 207-219
538 207-219538 207-219
538 207-219
idescitation
 
Key Frame Extraction for Salient Activity Recognition
Key Frame Extraction for Salient Activity RecognitionKey Frame Extraction for Salient Activity Recognition
Key Frame Extraction for Salient Activity Recognition
Suhas Pillai
 
A Hardware Model to Measure Motion Estimation with Bit Plane Matching Algorithm
A Hardware Model to Measure Motion Estimation with Bit Plane Matching AlgorithmA Hardware Model to Measure Motion Estimation with Bit Plane Matching Algorithm
A Hardware Model to Measure Motion Estimation with Bit Plane Matching Algorithm
TELKOMNIKA JOURNAL
 
E1083237
E1083237E1083237
E1083237
IJERD Editor
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
Multiresolution SVD based Image Fusion
Multiresolution SVD based Image FusionMultiresolution SVD based Image Fusion
Multiresolution SVD based Image Fusion
IOSRJVSP
 
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
mokamojah
 
IMAGE RECOGNITION USING MATLAB SIMULINK BLOCKSET
IMAGE RECOGNITION USING MATLAB SIMULINK BLOCKSETIMAGE RECOGNITION USING MATLAB SIMULINK BLOCKSET
IMAGE RECOGNITION USING MATLAB SIMULINK BLOCKSET
IJCSEA Journal
 
Markerless motion capture for 3D human model animation using depth camera
Markerless motion capture for 3D human model animation using depth cameraMarkerless motion capture for 3D human model animation using depth camera
Markerless motion capture for 3D human model animation using depth camera
TELKOMNIKA JOURNAL
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
CSCJournals
 
Wavelet-Based Warping Technique for Mobile Devices
Wavelet-Based Warping Technique for Mobile DevicesWavelet-Based Warping Technique for Mobile Devices
Wavelet-Based Warping Technique for Mobile Devices
csandit
 
RADAR Image Fusion Using Wavelet Transform
RADAR Image Fusion Using Wavelet TransformRADAR Image Fusion Using Wavelet Transform
RADAR Image Fusion Using Wavelet Transform
INFOGAIN PUBLICATION
 
Computer Vision Based 3D Reconstruction : A Review
Computer Vision Based 3D Reconstruction : A ReviewComputer Vision Based 3D Reconstruction : A Review
Computer Vision Based 3D Reconstruction : A Review
IJECEIAES
 
Обучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграхОбучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграх
Anatol Alizar
 
A Detailed Analysis on Feature Extraction Techniques of Panoramic Image Stitc...
A Detailed Analysis on Feature Extraction Techniques of Panoramic Image Stitc...A Detailed Analysis on Feature Extraction Techniques of Panoramic Image Stitc...
A Detailed Analysis on Feature Extraction Techniques of Panoramic Image Stitc...
IJEACS
 
Gesture Recognition Based Mouse Events
Gesture Recognition Based Mouse EventsGesture Recognition Based Mouse Events
Gesture Recognition Based Mouse Events
ijcsit
 
Multi Wavelet for Image Retrival Based On Using Texture and Color Querys
Multi Wavelet for Image Retrival Based On Using Texture and  Color QuerysMulti Wavelet for Image Retrival Based On Using Texture and  Color Querys
Multi Wavelet for Image Retrival Based On Using Texture and Color Querys
IOSR Journals
 
Image Processing Compression and Reconstruction by Using New Approach Artific...
Image Processing Compression and Reconstruction by Using New Approach Artific...Image Processing Compression and Reconstruction by Using New Approach Artific...
Image Processing Compression and Reconstruction by Using New Approach Artific...
CSCJournals
 

Similar to Imran2016 (20)

538 207-219
538 207-219538 207-219
538 207-219
 
Key Frame Extraction for Salient Activity Recognition
Key Frame Extraction for Salient Activity RecognitionKey Frame Extraction for Salient Activity Recognition
Key Frame Extraction for Salient Activity Recognition
 
A Hardware Model to Measure Motion Estimation with Bit Plane Matching Algorithm
A Hardware Model to Measure Motion Estimation with Bit Plane Matching AlgorithmA Hardware Model to Measure Motion Estimation with Bit Plane Matching Algorithm
A Hardware Model to Measure Motion Estimation with Bit Plane Matching Algorithm
 
E1083237
E1083237E1083237
E1083237
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Multiresolution SVD based Image Fusion
Multiresolution SVD based Image FusionMultiresolution SVD based Image Fusion
Multiresolution SVD based Image Fusion
 
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
 
IMAGE RECOGNITION USING MATLAB SIMULINK BLOCKSET
IMAGE RECOGNITION USING MATLAB SIMULINK BLOCKSETIMAGE RECOGNITION USING MATLAB SIMULINK BLOCKSET
IMAGE RECOGNITION USING MATLAB SIMULINK BLOCKSET
 
Markerless motion capture for 3D human model animation using depth camera
Markerless motion capture for 3D human model animation using depth cameraMarkerless motion capture for 3D human model animation using depth camera
Markerless motion capture for 3D human model animation using depth camera
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
 
paper
paperpaper
paper
 
Wavelet-Based Warping Technique for Mobile Devices
Wavelet-Based Warping Technique for Mobile DevicesWavelet-Based Warping Technique for Mobile Devices
Wavelet-Based Warping Technique for Mobile Devices
 
RADAR Image Fusion Using Wavelet Transform
RADAR Image Fusion Using Wavelet TransformRADAR Image Fusion Using Wavelet Transform
RADAR Image Fusion Using Wavelet Transform
 
Computer Vision Based 3D Reconstruction : A Review
Computer Vision Based 3D Reconstruction : A ReviewComputer Vision Based 3D Reconstruction : A Review
Computer Vision Based 3D Reconstruction : A Review
 
Обучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграхОбучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграх
 
A Detailed Analysis on Feature Extraction Techniques of Panoramic Image Stitc...
A Detailed Analysis on Feature Extraction Techniques of Panoramic Image Stitc...A Detailed Analysis on Feature Extraction Techniques of Panoramic Image Stitc...
A Detailed Analysis on Feature Extraction Techniques of Panoramic Image Stitc...
 
40120140503006
4012014050300640120140503006
40120140503006
 
Gesture Recognition Based Mouse Events
Gesture Recognition Based Mouse EventsGesture Recognition Based Mouse Events
Gesture Recognition Based Mouse Events
 
Multi Wavelet for Image Retrival Based On Using Texture and Color Querys
Multi Wavelet for Image Retrival Based On Using Texture and  Color QuerysMulti Wavelet for Image Retrival Based On Using Texture and  Color Querys
Multi Wavelet for Image Retrival Based On Using Texture and Color Querys
 
Image Processing Compression and Reconstruction by Using New Approach Artific...
Image Processing Compression and Reconstruction by Using New Approach Artific...Image Processing Compression and Reconstruction by Using New Approach Artific...
Image Processing Compression and Reconstruction by Using New Approach Artific...
 

Recently uploaded

Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
DuvanRamosGarzon1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 

Recently uploaded (20)

Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 

Imran2016

  • 1. Human Action Recognition Using RGB-D Sensor and Deep Convolutional Neural Networks Javed Imran Department of Computer Science and Engineering IIT Roorkee Roorkee, India javed.csit@gmail.com Praveen Kumar Department of Computer Science and Engineering Visvesvaraya National Institute of Technology Nagpur, India praveen.kverma@gmail.com Abstract— In this paper, we propose an approach to recognize human actions by the fusion of RGB and Depth data. Firstly, Motion History Images (MHI) are generated from RGB videos which represent the temporal information about the action. Then the original depth data is rotated in 3D point clouds and three Depth Motion Maps (DMM) are generated over the entire depth sequence corresponding to the front, side and top projection views. A 4 Channel Deep Convolutional Neural Network is trained, where the first channel is for classifying MHIs and the remaining three for the front, side and top view generated from depth data respectively. The proposed method is evaluated on publically available UTD-MHAD dataset which contains both RGB and depth videos. Experimental results show that combining two modalities gives better recognition accuracy than using each modality individually. I. INTRODUCTION Human action recognition is one of the most important topics in computer vision. It has applications in the field of security systems, human-computer interaction, video indexing and querying, content based video analytics, web-video search and retrieval, bio-mechanics, monitoring and intelligent environments. Prior to the release of depth cameras, research on action recognition was mainly focused on learning and recognizing actions from image sequences captured by traditional RGB video cameras [1]. But with the introduction of low cost depth sensors like Microsoft Kinect and ASUS Xtion, many researchers have focused on action recognition using depth information [2-4]. The depth cameras have several advantages as compared to RGB cameras. For example, the outputs of depth cameras are insensitive to changes in lightning conditions. Secondly, the 3D structure and shape information provided by the depth maps makes it easier to deal with problems like segmentation and detection. Recently Deep Convolutional Neural Networks (CNN) have given state-of-the-art performance in image recognition, segmentation and classification task [5-6]. In this paper, we have used a pretrained ImageNet model because no such large RGB-D dataset exists which can train a deep CNN from scratch. This approach bears similarity to other multi-stream approaches [7-9]. As suggested in [8], we first rotate the 3D point clouds constructed from original depth data to handle view invariance. These rotated depth frames are used to generate depth motion maps (DMMs) by accumulating motion energy in three projected views [10]. Motion History Images (MHIs) are generated from RGB videos where the intensity of each pixel is a function of the recency of motion in a sequence [11]. Four CNNs are trained separately corresponding to front view, side view, top view and MHI, and their results are fused to produce the final classification score. The rest of this paper is organized as follows. In Section 2, related works are presented. In section 3, implementation details are discussed which includes MHI and DMM generation along with the proposed 4-stream CNN architecture. Section 4 describes the experimental results. Section 5 concludes the paper. II. RELATED WORK Motion History Image based action recognition has been actively studied since Bobick and Davis proposed the concept of Motion Energy Image (MEI) and Motion History Image (MHI) to recognize many types of aerobics exercises [11]. Though the computation of MHI is computationally inexpensive, but this template matching approach is susceptible to noise and variations in performing same actions by different individuals. In [12], Meng et al. combined Motion History Image (MHI) and Modified Motion History Image (MMHI) and used SVM_2K as linear binary classifier. With the release of Kinect camera in 2010, researchers shifted their focus on action recognition based on depth data. In [2], Li et al. construct an action graph to describe the pattern of the action. Their action graph consists of multiple nodes where each node represents a set of salient postures that is characterized by a bag of 3D points. But this sampling scheme is view dependent. In [3], a view invariant Histogram of 3D joint locations (HOJ3D) were calculated from action depth sequences. They are reprojected using LDA and different clusters are formed based on similar posture visual words. Discrete hidden Markrov Models were used to model the temporal evolutions of these visual words. In [4], Depth Motion Maps (DMM) are generated by projecting depth maps onto three orthogonal planes. Histogram of Gradient (HOG) features are then extracted from DMMs and classified using linear SVM. In [10], normalized DMMs are generated by absolute differencing between two consecutive depth maps without thresholding, and an l2-regularized classifier is 2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India 978-1-5090-2029-4/16/$31.00 @2016 IEEE 144
  • 2. employed for action recognition. In general, all the above mentioned methods are based on hand-crafted features which are either time consuming or dataset dependent. With the recent success of Convolutional Neural Networks [5], deep neural architectures are widely used in the area of image and video classification tasks [13, 7]. The availability of pretrained ImageNet models [14] further leverages the researchers to apply them in the domain of RGB-D action and object recognition. In [15], Ji et al. have used 3-dimensional (3D) CNN model for action recognition. This model extracts features from both spatial and temporal dimensions by performing 3D convolutions. The first layer in their architecture was hardwired to encode prior knowledge on features. Thus, it is not clear how this approach will perform when applied to a new dataset. In [7], Simonyan and Zisserman have proposed a Two-Stream CNN for action recognition in videos. They have used the fact that a video can be decomposed into spatial and temporal components. Each stream is implemented using a deep ConvNet, softmax scores of which are combined by late fusion. In [9], two separate CNNs are trained for RGB and depth images. A jet colormap is applied to depth images to convert them into three channel image so as to make effective use of ImageNet pretrained model. Concatenation of last layers of both CNNs into one fully connected layer followed by a softmax classifier is used for object recognition. Our work bears the most similarity to [8], in which each depth map is rotated in 3D point clouds before generating three DMMs for front, side and top view. However, we have also used RGB video as the second modality to generate MHIs. A four-channel CNNs is then trained, and their scores are combined by late fusion to give the classification score. III. MHI, DMM AND 4-STREAM CNN In this section, we describe the generation of Motion History Image from RGB videos and Depth Motion Maps from depth data. We also discuss the input preprocessing, proposed CNN architecture, network training and class score fusion. A. Motion History Image The concept of Motion History Image was proposed by Bobick and Davis [11] in 2001. They proposed the generation of motion energy image (MEI), which captures the occurrence of motion in a video sequence in one image. Next they generated motion history image (MHI), which gives the temporal information of the motion in the image plane. The brightness of the pixels in this image is higher where the motion has occurred more recently as compared to where the motion has occurred earlier. Addition of both the MHI and MEI generate the MHI for a video sample. The MHI H (x,y,t) can be computed from an update function D(x,y,t) [17]: H (x, y, t) = (1) Here, (x, y) represents the pixel location, t denotes the time, D (x, y, t) shows the object presence (or motion) in the current video image, the duration τ governs the temporal extent of the movement, and δ is the decay parameter. B. Depth Motion Maps Yang et. al [4] proposed the concept of Depth Motion Maps (DMM) to capture the 3D structure and depth information. The depth images in the entire depth video sequence are projected onto three orthogonal planes. Then the absolute difference between consecutive projected depth maps are calculated and combined to form three 2D depth motion maps. Before projection, the depth data is rotated in 3D point clouds as discussed in [8]. This is done to handle the problem of view invariance and also provides a method to generate more training data. Finally, the DMM is generated as follows [10]: (2) where is the projected map of ith frame under projection view v {front, side, top}. C. Input preprocessing The MHI and three DMMs (front view, side view and top view) generated using the techniques discussed above are in grayscale. So we colorize them into 3 channel RGB images so as to fully utilize the power of CNN pre-trained on ImageNet. For the MHIs, we apply five different colormaps: copper, hot, pink, gray and bone available in Matlab R2015. The rotated DMMs are colorized using the improved rainbow pseudocoloring technique proposed in [9]. Fig. 1 and 2 shows the result after coloring. The resulting images are then resized to 224×224 so as to make them compatible with the pre- trained ImageNet model. (a) Original MHI (b) MHIs obtained after applying different colormaps Fig. 1. Examples of original and colored MHIs for Swipe-right action D. Proposed CNN Architecture In [6], Simonyan & Zisserman have shown that very high recognition accuracy can be achieved by using very deep architecture and filters with very small (3×3) receptive fields. 2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India 145
  • 3. (a) Original front, side and top view DMM (b) DMMs obtained after applying pseudocoloring Fig. 2. Examples of DMM for Wave action In addition to this, very deep models generalize well to other datasets. This motivated us to use their pre-trained VGG-16 model for training our network. VGG-16 consists of sixteen layers, including thirteen convolutional layers and three fully connected layers. In our proposed architecture, four such networks are combined by fusing their softmax scores as shown in Fig 3. Two fusion schemes are analyzed; first is average rule, and second is product rule. E. Network Training & Class Score Fusion Four CNNs are trained; one for the color coded MHIs and remaining three for front, side and top view DMMs respectively. Dropout layer with ratio set to 0.8 is added between the last two fully connected layers to avoid overfitting. The learning rate is set to 10-4 , weight decay set to 0.0005 and momentum set to 0.9 with a batch size of 16. The entire network is trained using MatConvNet [18] toolbox on a system with NVIDIA Quadro K4200 GPU. During testing, the posterior probabilities generated by the softmax layer of four CNNs are combined using average and product rule. IV. EXPERIMENTAL RESULTS Our proposed framework is evaluated on publically available UTD-MHAD dataset [18] which contains both RGB and depth data captured using Kinect. It contains 27 actions as shown in Fig. 4. The same experimental setting in [18] is followed where the data from the subject numbers 1, 3, 5, 7 is used for training, and the data for the subject numbers 2, 4, 6, 8 is used for testing. The results are given in Table 1 and the individual class accuracy is shown in Fig. 5. Fig. 3. DMM and MHI based 4-stream deep CNN architecture for action recognition 2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India 146
  • 4. Swipe-left Swipe-right Wave Clap Throw Arm-cross Basketball-shoot Draw X Draw-circle-CC Draw-circle-CCW Draw-traingle Bowling Boxing Baseball-swing Tennis-swing Arm-curl Tennis-serve Push Knock Catch Pickup-Throw Jog Walk Sit-to-stand Stand-to-sit Lunge Squat Fig. 4. Samples of UTD-MHAD dataset Fig. 5. Class-specific accuracy for UTD-MHAD dataset (Using product rule based decision-level fusion)xample of a figure caption. (figure caption) 2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India 147
  • 5. TABLE I. COMPARISON OF RECOGNITION ACCURACY ON UTD-MHAD DATASET Method Accuracy (%) C. Chen et al. [18] 79.1 Bulbul et al. [19] 88.4 Ours Depth (Avg. of front, side & top DMM) RGB (using only MHI) Depth + RGB (Average Rule) Depth + RGB (Product Rule) 87.9 70.0 88.8 91.2 V. CONCLUSION In this paper, we have presented a deep convolutional neural network based framework classify human actions based on RGB-D data. The experimental results on UTD-MHAD dataset demonstrates that fusion of different modalities can give better performance than using each modality individually. Our approach also proves to be robust and efficient than traditional hand-crafted based feature extraction techniques. State-of-the-art results can be achieved even on a small dataset by fine tuning a pre-trained model like VGG-16. In the future, we will combine other modalities like skeleton stream and handle confusion between similar classes by applying Dempster-Shafer Belief theory. REFERENCES [1] Aggarwal, Jake K., and Michael S. Ryoo. "Human activity analysis: A review." ACM Computing Surveys (CSUR) 43.3 (2011): 16. [2] Li, Wanqing, Zhengyou Zhang, and Zicheng Liu. "Action recognition based on a bag of 3d points." Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on. IEEE, 2010. [3] Xia, Lu, Chia-Chih Chen, and J. K. Aggarwal. "View invariant human action recognition using histograms of 3d joints." Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEE, 2012. [4] Yang, Xiaodong, Chenyang Zhang, and YingLi Tian. "Recognizing actions using depth motion maps-based histograms of oriented gradients."Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012. [5] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [6] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556(2014). [7] Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." Advances in Neural Information Processing Systems. 2014. [8] Wang, Pichao, et al. "ConvNets-Based Action Recognition from Depth Maps through Virtual Cameras and Pseudocoloring." Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 2015. [9] Eitel, Andreas, et al. "Multimodal deep learning for robust RGB-D object recognition." Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015. [10] Chen, Chen, Kui Liu, and Nasser Kehtarnavaz. "Real-time human action recognition based on depth motion maps." Journal of Real-Time Image Processing (2013): 1-9. [11] Bobick, Aaron F., and James W. Davis. "The recognition of human movement using temporal templates." Pattern Analysis and Machine Intelligence, IEEE Transactions on 23.3 (2001): 257-267. [12] Meng, Hongying, et al. "Motion history histograms for human action recognition." Embedded Computer Vision. Springer London, 2009. 139- 162. [13] Krizhevsky, Alex, and Geoffrey E. Hinton. "Using very deep autoencoders for content-based image retrieval." ESANN. 2011. [14] http://www.vlfeat.org/matconvnet/pretrained/ [15] Ji, Shuiwang, et al. "3D convolutional neural networks for human action recognition." Pattern Analysis and Machine Intelligence, IEEE Transactions on 35.1 (2013): 221-231. [16] Ahad, Md Atiqur Rahman, et al. "Motion history image: its variants and applications." Machine Vision and Applications 23.2 (2012): 255-281. [17] Vedaldi, Andrea, and Karel Lenc. "MatConvNet: Convolutional neural networks for matlab." Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 2015. [18] Chen, Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. "UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor." Image Processing (ICIP), 2015 IEEE International Conference on. IEEE, 2015. [19] Bulbul, Mohammad Farhad, Yunsheng Jiang, and Jinwen Ma. "DMMs- Based Multiple Features Fusion for Human Action Recognition." International Journal of Multimedia Data Engineering and Management (IJMDEM) 6.4 (2015): 23-39. 2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India 148