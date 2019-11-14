Successfully reported this slideshow.
  1. 1. Human Action Recognition Using RGB-D Sensor and Deep Convolutional Neural Networks Javed Imran Department of Computer Science and Engineering IIT Roorkee Roorkee, India javed.csit@gmail.com Praveen Kumar Department of Computer Science and Engineering Visvesvaraya National Institute of Technology Nagpur, India praveen.kverma@gmail.com Abstract— In this paper, we propose an approach to recognize human actions by the fusion of RGB and Depth data. Firstly, Motion History Images (MHI) are generated from RGB videos which represent the temporal information about the action. Then the original depth data is rotated in 3D point clouds and three Depth Motion Maps (DMM) are generated over the entire depth sequence corresponding to the front, side and top projection views. A 4 Channel Deep Convolutional Neural Network is trained, where the first channel is for classifying MHIs and the remaining three for the front, side and top view generated from depth data respectively. The proposed method is evaluated on publically available UTD-MHAD dataset which contains both RGB and depth videos. Experimental results show that combining two modalities gives better recognition accuracy than using each modality individually. I. INTRODUCTION Human action recognition is one of the most important topics in computer vision. It has applications in the field of security systems, human-computer interaction, video indexing and querying, content based video analytics, web-video search and retrieval, bio-mechanics, monitoring and intelligent environments. Prior to the release of depth cameras, research on action recognition was mainly focused on learning and recognizing actions from image sequences captured by traditional RGB video cameras [1]. But with the introduction of low cost depth sensors like Microsoft Kinect and ASUS Xtion, many researchers have focused on action recognition using depth information [2-4]. The depth cameras have several advantages as compared to RGB cameras. For example, the outputs of depth cameras are insensitive to changes in lightning conditions. Secondly, the 3D structure and shape information provided by the depth maps makes it easier to deal with problems like segmentation and detection. Recently Deep Convolutional Neural Networks (CNN) have given state-of-the-art performance in image recognition, segmentation and classification task [5-6]. In this paper, we have used a pretrained ImageNet model because no such large RGB-D dataset exists which can train a deep CNN from scratch. This approach bears similarity to other multi-stream approaches [7-9]. As suggested in [8], we first rotate the 3D point clouds constructed from original depth data to handle view invariance. These rotated depth frames are used to generate depth motion maps (DMMs) by accumulating motion energy in three projected views [10]. Motion History Images (MHIs) are generated from RGB videos where the intensity of each pixel is a function of the recency of motion in a sequence [11]. Four CNNs are trained separately corresponding to front view, side view, top view and MHI, and their results are fused to produce the final classification score. The rest of this paper is organized as follows. In Section 2, related works are presented. In section 3, implementation details are discussed which includes MHI and DMM generation along with the proposed 4-stream CNN architecture. Section 4 describes the experimental results. Section 5 concludes the paper. II. RELATED WORK Motion History Image based action recognition has been actively studied since Bobick and Davis proposed the concept of Motion Energy Image (MEI) and Motion History Image (MHI) to recognize many types of aerobics exercises [11]. Though the computation of MHI is computationally inexpensive, but this template matching approach is susceptible to noise and variations in performing same actions by different individuals. In [12], Meng et al. combined Motion History Image (MHI) and Modified Motion History Image (MMHI) and used SVM_2K as linear binary classifier. With the release of Kinect camera in 2010, researchers shifted their focus on action recognition based on depth data. In [2], Li et al. construct an action graph to describe the pattern of the action. Their action graph consists of multiple nodes where each node represents a set of salient postures that is characterized by a bag of 3D points. But this sampling scheme is view dependent. In [3], a view invariant Histogram of 3D joint locations (HOJ3D) were calculated from action depth sequences. They are reprojected using LDA and different clusters are formed based on similar posture visual words. Discrete hidden Markrov Models were used to model the temporal evolutions of these visual words. In [4], Depth Motion Maps (DMM) are generated by projecting depth maps onto three orthogonal planes. Histogram of Gradient (HOG) features are then extracted from DMMs and classified using linear SVM. In [10], normalized DMMs are generated by absolute differencing between two consecutive depth maps without thresholding, and an l2-regularized classifier is 2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India 978-1-5090-2029-4/16/$31.00 @2016 IEEE 144
  2. 2. employed for action recognition. In general, all the above mentioned methods are based on hand-crafted features which are either time consuming or dataset dependent. With the recent success of Convolutional Neural Networks [5], deep neural architectures are widely used in the area of image and video classification tasks [13, 7]. The availability of pretrained ImageNet models [14] further leverages the researchers to apply them in the domain of RGB-D action and object recognition. In [15], Ji et al. have used 3-dimensional (3D) CNN model for action recognition. This model extracts features from both spatial and temporal dimensions by performing 3D convolutions. The first layer in their architecture was hardwired to encode prior knowledge on features. Thus, it is not clear how this approach will perform when applied to a new dataset. In [7], Simonyan and Zisserman have proposed a Two-Stream CNN for action recognition in videos. They have used the fact that a video can be decomposed into spatial and temporal components. Each stream is implemented using a deep ConvNet, softmax scores of which are combined by late fusion. In [9], two separate CNNs are trained for RGB and depth images. A jet colormap is applied to depth images to convert them into three channel image so as to make effective use of ImageNet pretrained model. Concatenation of last layers of both CNNs into one fully connected layer followed by a softmax classifier is used for object recognition. Our work bears the most similarity to [8], in which each depth map is rotated in 3D point clouds before generating three DMMs for front, side and top view. However, we have also used RGB video as the second modality to generate MHIs. A four-channel CNNs is then trained, and their scores are combined by late fusion to give the classification score. III. MHI, DMM AND 4-STREAM CNN In this section, we describe the generation of Motion History Image from RGB videos and Depth Motion Maps from depth data. We also discuss the input preprocessing, proposed CNN architecture, network training and class score fusion. A. Motion History Image The concept of Motion History Image was proposed by Bobick and Davis [11] in 2001. They proposed the generation of motion energy image (MEI), which captures the occurrence of motion in a video sequence in one image. Next they generated motion history image (MHI), which gives the temporal information of the motion in the image plane. The brightness of the pixels in this image is higher where the motion has occurred more recently as compared to where the motion has occurred earlier. Addition of both the MHI and MEI generate the MHI for a video sample. The MHI H (x,y,t) can be computed from an update function D(x,y,t) [17]: H (x, y, t) = (1) Here, (x, y) represents the pixel location, t denotes the time, D (x, y, t) shows the object presence (or motion) in the current video image, the duration τ governs the temporal extent of the movement, and δ is the decay parameter. B. Depth Motion Maps Yang et. al [4] proposed the concept of Depth Motion Maps (DMM) to capture the 3D structure and depth information. The depth images in the entire depth video sequence are projected onto three orthogonal planes. Then the absolute difference between consecutive projected depth maps are calculated and combined to form three 2D depth motion maps. Before projection, the depth data is rotated in 3D point clouds as discussed in [8]. This is done to handle the problem of view invariance and also provides a method to generate more training data. Finally, the DMM is generated as follows [10]: (2) where is the projected map of ith frame under projection view v {front, side, top}. C. Input preprocessing The MHI and three DMMs (front view, side view and top view) generated using the techniques discussed above are in grayscale. So we colorize them into 3 channel RGB images so as to fully utilize the power of CNN pre-trained on ImageNet. For the MHIs, we apply five different colormaps: copper, hot, pink, gray and bone available in Matlab R2015. The rotated DMMs are colorized using the improved rainbow pseudocoloring technique proposed in [9]. Fig. 1 and 2 shows the result after coloring. The resulting images are then resized to 224×224 so as to make them compatible with the pre- trained ImageNet model. (a) Original MHI (b) MHIs obtained after applying different colormaps Fig. 1. Examples of original and colored MHIs for Swipe-right action D. Proposed CNN Architecture In [6], Simonyan & Zisserman have shown that very high recognition accuracy can be achieved by using very deep architecture and filters with very small (3×3) receptive fields. 2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India 145
  3. 3. (a) Original front, side and top view DMM (b) DMMs obtained after applying pseudocoloring Fig. 2. Examples of DMM for Wave action In addition to this, very deep models generalize well to other datasets. This motivated us to use their pre-trained VGG-16 model for training our network. VGG-16 consists of sixteen layers, including thirteen convolutional layers and three fully connected layers. In our proposed architecture, four such networks are combined by fusing their softmax scores as shown in Fig 3. Two fusion schemes are analyzed; first is average rule, and second is product rule. E. Network Training & Class Score Fusion Four CNNs are trained; one for the color coded MHIs and remaining three for front, side and top view DMMs respectively. Dropout layer with ratio set to 0.8 is added between the last two fully connected layers to avoid overfitting. The learning rate is set to 10-4 , weight decay set to 0.0005 and momentum set to 0.9 with a batch size of 16. The entire network is trained using MatConvNet [18] toolbox on a system with NVIDIA Quadro K4200 GPU. During testing, the posterior probabilities generated by the softmax layer of four CNNs are combined using average and product rule. IV. EXPERIMENTAL RESULTS Our proposed framework is evaluated on publically available UTD-MHAD dataset [18] which contains both RGB and depth data captured using Kinect. It contains 27 actions as shown in Fig. 4. The same experimental setting in [18] is followed where the data from the subject numbers 1, 3, 5, 7 is used for training, and the data for the subject numbers 2, 4, 6, 8 is used for testing. The results are given in Table 1 and the individual class accuracy is shown in Fig. 5. Fig. 3. DMM and MHI based 4-stream deep CNN architecture for action recognition 2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India 146
  4. 4. Swipe-left Swipe-right Wave Clap Throw Arm-cross Basketball-shoot Draw X Draw-circle-CC Draw-circle-CCW Draw-traingle Bowling Boxing Baseball-swing Tennis-swing Arm-curl Tennis-serve Push Knock Catch Pickup-Throw Jog Walk Sit-to-stand Stand-to-sit Lunge Squat Fig. 4. Samples of UTD-MHAD dataset Fig. 5. Class-specific accuracy for UTD-MHAD dataset (Using product rule based decision-level fusion)xample of a figure caption. (figure caption) 2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India 147
  5. 5. TABLE I. COMPARISON OF RECOGNITION ACCURACY ON UTD-MHAD DATASET Method Accuracy (%) C. Chen et al. [18] 79.1 Bulbul et al. [19] 88.4 Ours Depth (Avg. of front, side & top DMM) RGB (using only MHI) Depth + RGB (Average Rule) Depth + RGB (Product Rule) 87.9 70.0 88.8 91.2 V. CONCLUSION In this paper, we have presented a deep convolutional neural network based framework classify human actions based on RGB-D data. The experimental results on UTD-MHAD dataset demonstrates that fusion of different modalities can give better performance than using each modality individually. Our approach also proves to be robust and efficient than traditional hand-crafted based feature extraction techniques. State-of-the-art results can be achieved even on a small dataset by fine tuning a pre-trained model like VGG-16. In the future, we will combine other modalities like skeleton stream and handle confusion between similar classes by applying Dempster-Shafer Belief theory. REFERENCES [1] Aggarwal, Jake K., and Michael S. Ryoo. "Human activity analysis: A review." ACM Computing Surveys (CSUR) 43.3 (2011): 16. [2] Li, Wanqing, Zhengyou Zhang, and Zicheng Liu. "Action recognition based on a bag of 3d points." Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on. IEEE, 2010. [3] Xia, Lu, Chia-Chih Chen, and J. K. Aggarwal. "View invariant human action recognition using histograms of 3d joints." Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEE, 2012. [4] Yang, Xiaodong, Chenyang Zhang, and YingLi Tian. "Recognizing actions using depth motion maps-based histograms of oriented gradients."Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012. [5] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [6] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556(2014). [7] Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." Advances in Neural Information Processing Systems. 2014. [8] Wang, Pichao, et al. "ConvNets-Based Action Recognition from Depth Maps through Virtual Cameras and Pseudocoloring." Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 2015. [9] Eitel, Andreas, et al. "Multimodal deep learning for robust RGB-D object recognition." Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015. [10] Chen, Chen, Kui Liu, and Nasser Kehtarnavaz. "Real-time human action recognition based on depth motion maps." Journal of Real-Time Image Processing (2013): 1-9. [11] Bobick, Aaron F., and James W. Davis. "The recognition of human movement using temporal templates." Pattern Analysis and Machine Intelligence, IEEE Transactions on 23.3 (2001): 257-267. [12] Meng, Hongying, et al. "Motion history histograms for human action recognition." Embedded Computer Vision. Springer London, 2009. 139- 162. [13] Krizhevsky, Alex, and Geoffrey E. Hinton. "Using very deep autoencoders for content-based image retrieval." ESANN. 2011. [14] http://www.vlfeat.org/matconvnet/pretrained/ [15] Ji, Shuiwang, et al. "3D convolutional neural networks for human action recognition." Pattern Analysis and Machine Intelligence, IEEE Transactions on 35.1 (2013): 221-231. [16] Ahad, Md Atiqur Rahman, et al. "Motion history image: its variants and applications." Machine Vision and Applications 23.2 (2012): 255-281. [17] Vedaldi, Andrea, and Karel Lenc. "MatConvNet: Convolutional neural networks for matlab." Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 2015. [18] Chen, Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. "UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor." Image Processing (ICIP), 2015 IEEE International Conference on. IEEE, 2015. [19] Bulbul, Mohammad Farhad, Yunsheng Jiang, and Jinwen Ma. "DMMs- Based Multiple Features Fusion for Human Action Recognition." International Journal of Multimedia Data Engineering and Management (IJMDEM) 6.4 (2015): 23-39. 2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India 148

