Selective local binary pattern with convolutional neural network for facial ...
Iciap 2
1. Daily Living Activities Recognition via Efficient
High and Low Level Cues Combination and
Fisher Kernel Representation
Negar Rostamzadeh1
Gloria Zen1
Ionut Mironica2
Jasper Uijlings1
Nicu Sebe1
1 DISI, University of Trento, Trento, Italy
2 LAPI, University Politehnica of Bucharest, Bucharest, Romania
3. Action Recognition in videos
Answer phone or dial phone?
Difficulties in fine-grained activities:
1. Slightly different activities in motion and appearance
2. Different manner of performing the similar task.
Motivation – State of the art – Our approach – Results - Conclusion 3/23
4. Object Centric approaches- SoA
Object-centric approaches- based on tracking and
trajectory analysis6,16
Advantages
Brendel et al, ICCV 2011 [5], May et al, CVPR 2004 [6]
Campos et al, WACV 2011 [23], Liu et al, CVPR 2007[16]
Limitations
Providing semantic/high-level
information of the scene
Handling occlusions in objects interactions
The broken and missed trajectories
The problem of curse of dimensionality
Motivation – State of the art – Our approach – Results - Conclusion 4/23
5. Non-object centric approaches - SoA
Bag-of-words approach relying on low-level
HoG STIP Foreground pixels HoF
Advantages
Laptev et al, CVPR 2008 [1], Willems et al, ECCV 2008 [2], Hospedales et al, ICCV 2009. Zen et al, CVPR 2011 [4],
Wong et al, CVPR 2007 [15], Chang et al, ICCV 2011 [17], Gilbert et al, ICCV 2009 [19],
Zelniker et al, ICCV 2008 [20], Gehrig et al, ICCV 2009[21], Mahbub et al, ICIEV [25]
5/23
Limitations
Robustness to noise & occlusions
Computational efficiency
1. Discard semantic & high-level
information of the scene.
2. Discard relationship among spatio-temporal
local features.
Motivation – State of the art – Our approach – Results - Conclusion
6. Enhanced descriptors - SoA
1. Relation between local features
Pair-wise10,11,12,18, local space or time neighborhood11,18, ST phrases9
2. Combining different local features
Such as local motion, appearance, and positions14,24
3. Enriching the combination of low level features with high- level
information
Detect and localize faces7, STIP volume8,9
Which body-part causes what motion?
Messing et al, ICCV, 2009 [7], Fathi et al, 2008 [8], Zhang et al, 2012 [9], Matikainen et al, 2010 [10], : Gaur et al, 2011 [11],
Savarese et al, 2008 [12], Kovashka et al, Gireddy et al, 2011 [14], CVPR 2007[18] , Shechtman et al, CVPR 2011 [24]
6/23
Motivation – State of the art – Our approach – Results - Conclusion
7. Fusing information
to produce enriched
Low-level cues
Input Video Classifier
descriptor
Apply a Feature-representation
Recognizing
Activities
Body-part
detector
Accumulation over
each video
Fisher Kernel to
model the Temporal
variation
Approach in a glance
Motivation – State of the art – Our approach – Results - Conclusion 7/25
8. Enhanced pose estimator
Body-pose estimation
What is the problem with an off-the-shelf detector?
Our Solution:
Employ an already-trained off-the-shelf
detector
We use the already trained
classifier, but we provide some
additional information from the
new dataset
BUFFY ADL
Motivation – State of the art – Our approach – Results - Conclusion 8/25
9. Enhanced pose estimator
Body-pose estimation- build on ‘Yang and Ramanan PAMI2012, CVPR11’ [29]
9/25
1. Model the body as a pictorial structure (Felzenshwalb-CVPR 2010)
2. Model the body as a Tree
3. Each possible body-configuration has a score
Local score: HoG Pair-wise score
HoG -
appearance
Scores by employing off-the-shelf
detector = Sinitial
Motivation – State of the art – Our approach – Results - Conclusion
10. 10/25
Enhanced pose estimator
Relative importance of foreground
and optical flow score
New Score = Sinitial
weights
Foreground Score Optical Flow Score
Motivation – State of the art – Our approach – Results - Conclusion
11. Enhanced pose estimator
New Score = Sinitial
SSooAA OOuurr a apppprrooaacchh OFpotriceaglr Foluonwd
SoA Our approach Optical Flow SoA Our approach Foreground
Motivation – State of the art – Our approach – Results - Conclusion 11/25
12. 12/25
Enhanced pose estimator used to enrich action
recognition approach
New Score = Sinitial
Tuning
Motivation – State of the art – Our approach – Results - Conclusion
13. Body-part
detector
Input Video
Fusing information
to produce enriched
descriptor
Low-level cues
Accumulation over
each video
Classifier
Apply a Feature-representation
Recognizing
Activities
Fisher Kernel to
model the Temporal
variation
Approach in a glance
Motivation – State of the art – Our approach – Results - Conclusion 13/25
14. Fisher Kernel (FK) Theory
Fisher Kernel in the state-of-the-art
1. Introduced by Jaakkol NIPS’99 [26]) for protein detection
2. Web audio classification (Moreno 2000)
3. Introduced in Computer Vision for Image categorization by [Perronnin, CVPR’07]
Fisher Kernel in image categorization Vs video analysis
1. Modeling the : spatial variation temporal variation
2. Visual documents: small patches frames of the video
3. Initial feature vectors : SIFT our novel descriptors for
action recognition
Motivation – State of the art – Our approach – Results - Conclusion 14/25
15. Fisher Kernel (FK) Theory
- combines the benefits of generative and discriminative approaches
- represents a signal as the gradient of the probability density function that is a learned
generative model of that signall
Motivation – State of the art – Our approach – Results - Conclusion 15/25
16. Results on the ADL Rochester dataset
Motivation – State of the art – Our approach – Results - Conclusion 16/25
17. Conclusion
17/25
We proposed a novel descriptor that is
combining high-level semantic information and
low–level cues.
We propose an enhanced body-pose estimator.
We model the Temporal variation by the Fisher-
Kernel representation.
Motivation – State of the art – Our approach – Results - Conclusion
19. References
1. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008, June). Learning
realistic human actions from movies. In Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE..
2. Willems, G., Tuytelaars, T., & Van Gool, L. (2008). An efficient dense and scale–
invariant spatio–temporal interest point detector. Computer Vision–ECCV
2008, 650–663.
3. Hospedales, T., Gong, S., & Xiang, T. (2009, September). A markov clustering
topic model for mining behaviour in video. In Computer Vision, 2009 IEEE 12th
International Conference on (pp. 1165–1172). IEEE
4. Zen, G., and Ricci, E. "Earth mover's prototypes: A convex learning approach
for discovering activity patterns in dynamic scenes." Computer Vision and
Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
5. Brendel, W., & Todorovic, S. (2011, November). Learning spatiotemporal
graphs of human activities. In Computer Vision (ICCV), 2011 IEEE International
Conference on (pp. 778–785). IEEE.
20. References
6. Han, M., Xu, W., Tao, H., & Gong, Y. (2004, June). An algorithm for
multiple object trajectory tracking. In Computer Vision and Pattern
Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer
Society Conference on(Vol. 1, pp. I–864). IEEE.
7. Messing, R., Pal, C., & Kautz, H. (2009, September). Activity recognition
using the velocity histories of tracked keypoints. In Computer Vision, 2009
IEEE 12th International Conference on (pp. 104–111). IEEE.
8. Fathi, A., and Mori, G. (2008, June). Action recognition by learning mid–
level motion features. In Computer Vision and Pattern Recognition, 2008.
CVPR 2008. IEEE Conference on (pp. 1–8). IEEE.
9. Zhang, Y., Liu, X., Chang, M. C., Ge, W., & Chen, T. (2012). Spatio–Temporal
phrases for activity recognition. Computer Vision–ECCV 2012, 707–721.
10. Matikainen, M. Hebert, and R. Sukthankar. Representing pairwise spatial
and temporal relations for action recognition. European Conference of
Computer Vision (ECCV) 2010, pages 508{521, 2010.
21. References
11. Gaur, U., Zhu, Y., Song, B., & Roy–Chowdhury, A. (2011, November). A
“string of feature graphs” model for recognition of complex activities in
natural videos. InComputer Vision (ICCV), 2011 IEEE International
Conference on (pp. 2595–2602). IEEE.
12. Savarese, S., DelPozo, A., Niebles, J. C., & Fei–Fei, L. (2008, January).
Spatial–Temporal correlatons for unsupervised action classification.
In Motion and video Computing, 2008. WMVC 2008. IEEE Workshop
on (pp. 1–8). IEEE.
13. Taralova, E., De la Torre, F., & Hebert, M. (2011, November). Source
constrained clustering. In Computer Vision (ICCV), 2011 IEEE International
Conference on (pp. 1927–1934) IEEE.
14. M. Malgireddy, I. Nwogu, and V. Govindaraju. A generative framework to
investigate the underlying patterns in human activities. International
Conference of Computer Vision Workshops (ICCV Workshops), 2011
15. Wong, S. F., Kim, T. K., & Cipolla, R. (2007, June). Learning motion
categories using both semantic and structural information. In Computer
Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on (pp. 1-
6). IEEE.
22. References
16. Liu, J., Luo, J., & Shah, M. (2009, June). Recognizing realistic actions from
videos “in the wild”. In Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on (pp. 1996-2003). IEEE.
17. Chang, M. C., Krahnstoever, N., & Ge, W. (2011, November). Probabilistic
group-level motion analysis and scenario recognition. In Computer Vision
(ICCV), 2011 IEEE International Conference on (pp. 747-754). IEEE.
18. Kovashka, A., & Grauman, K. Learning a hierarchy of discriminative space-time
neighborhood features for human action recognition. In Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on(pp.
2046-2053). IEEE.
19. Gilbert, A., Illingworth, J., & Bowden, R. (2009, September). Fast realistic
multi-action recognition using mined dense spatio-temporal features.
In Computer Vision, 2009 IEEE 12th International Conference on (pp. 925-
931). IEEE.
20. E. Zelniker, S. Gong, T. Xiang, et al. Global abnormal behaviour detection
using a network of cctv cameras. In The Eighth International Workshop on
Visual Surveillance-VS2008, 2008.
23. References
21. Gehrig, D., Kuehne, H., Woerner, A., & Schultz, T. (2009, December).
Hmm-based human motion recognition with optical flow data.
In Humanoid Robots, 2009. Humanoids 2009. 9th IEEE-RAS International
Conference on (pp. 425-430). IEEE.
22. Sadanand, S., & Corso, J. J. (2012, June). Action bank: A high-level
representation of activity in video. In Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on (pp. 1254-1241). IEEE.
23. de Campos, T., Barnard, M., Mikolajczyk, K., Kittler, J., Yan, F., Christmas,
W., & Windridge, D. (2011, January). An evaluation of bags-of-words and
spatio-temporal shapes for action recognition. In Applications of
Computer Vision (WACV), 2011 IEEE Workshop on (pp. 344-351). IEEE.
24. Shechtman, E., & Irani, M. Space-time behavior based correlation. In
Conference of Computer Vision and Pattern Recognition (CVPR), 2011.
25. Mahbub, U., Imtiaz, H., Ahad, M., & Rahman, A. (2012, May). Motion
clustering-based action recognition technique using optical flow.
In Informatics, Electronics & Vision (ICIEV), 2012 International Conference
on (pp. 919-924). IEEE.
24. References
26. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in
discriminative classifiers. Advances in neural information processing
systems, 487-493.
Editor's Notes
Hello, my name is Negar Rostamzadeh. I am working under the supervision of prof. Nicu Sebe at the University of Trernto, Italy. This work is a joint work with my colleagues Gloria Zen, Ionut Mironica and Jasper Ujilings.
To start with, I will talk about the difficulty of action recognition in daily living scenarios.
I will first cover the related work and then I will present the steps of the approach & the action recognition and pose estimation results on ADL dataset. Finally, I will draw some concluding remarks.
Human centric action recognition is always a challenging problem both in images and in videos. In some video scenarios, it is possible to extract a few frames from the video and then label the activity based on the information taken from those single frames, while in some scenarios it is very difficult to label the activities based on just a few frames. As an example take a look at these 2 images. In both there is a person who is looking at a cell phone but without looking at the next frames whether he is going to answer the phone or dial a phone number, we don’t have enough cues to recognize among these 2 actions.
As you see fine grained activities are the activities that are slightly different with each other in terms of motion, appearance and the objects that are present in the scene. Moreover the same action can also be done in very different manners.
Now let’s see what has been done in the state-of-the-art. There is a type of approach in the state-of-the-art that can be called Object-centric approaches. These approaches are based on defining objects of interests (here let’s say human), detecting and tracking them in consecutive frames. These approaches provide high-level information about the detected objects but they are unable to handle occlusions and the broken trajectories problem may occur.
The other group of approaches are non-object-centric approaches. Non-object centric approaches are popular nowadays and they are based on low level cues in a bag-of-words framework. These approaches are more robust to noise and occlusions and computationally more efficient but they also have some limitation. 1st of all because of jumping directly from low-level cues to the high-level information some semantic and high-level information is discarded. Moreover the relationship among low-level cues is discarded while these cues are analyzed in a bag-of-words framework.
Well, some approaches addressed these problems and tried to solve them by enhancing the descriptors. Some of them involve the relationship among local features in the descriptor. Such as the relation among pairs of local features, relation in the time or space local neighborhood and recently the space-time phrases. Some tried to enrich descriptors with a combination of different local features, such as local motions, local appearance and positions. Finally the 3d types of method enrich descriptors by a combination of local-features and high-level information that comes from detectors. Our approach can also be considered in this group. We define different parts of the body as our semantic classes and try to detect them and find out which motion is belonging to which body-part? Which body-part causes what motion?
In this work we enrich the descriptor by a combination of low and high level information. Local motion is obtained by optical flow and quantized in 8 directions. Detected body parts represent different semantic classes. Then for fusing this information we make a vector of 8 bins for each body part and find the motions that are belonging to each body part. Afterward we concatenate the motion vectors of all semantic classes. Then we represent feature descriptors of different videos by 2 representations. The first one is a simple accumulation of these feature vectors over each video and the second one is with the use of fisher kernel representation in order to model the temporal variation. Then by applying a classification method these videos are classified and the recognizing task is finished. In the following section I will also talk about our enhanced body-part detector method and applying the fisher kernel approach to model the temporal variation.
In the case of body pose estimation, a significant drop in accuracy has been observed when
a detector is trained on one dataset and it is evaluated on a different one. A possible solution to this is to set the body pose groundtruth for the new
dataset and re-train the classifier. However this procedure is very expensive and requires a consistent delay every time a new dataset has to be analyzed. Instead of training another classifier on the new dataset, we propose to use the already trained classifier, but we provide some additional information from the new dataset.
We build our approach on top of “Yang and Ramanan” approach that is among the best approaches in the literature. I will firstly summarize their approach.
Their model is a pictorial model. In the pictorial structure body is modeled as an ideal template in a graph structure, while single body parts are the nodes of graph and edges presented by springs. Then different configuration are made by deforming the main template.
Yang and Ramanan model this graph as a tree to simplify the model and also each node just be connected with a spring to a parent node.
Then a score is given to each possible body-configuration.
This score is made by summing up the single body part scores and pairwise scores. We call this score as the initial score.
Here I have presented our new score. These are the foreground and optical flow scores. FG score represent the ratio number of foreground pixels presented in a box inside the predicted bounding box. The size of this box is related to the parameter gamma. The OF score is also computed similarly as the ratio number of ofs in a box with the related size of lambda. Beta and eta represent the weights of fg and of scores, while alpha is the relative importance of these 2 score. It means that alpha present which of the optical flow or foreground score improves the detection rate more.
Here I have presented the case that alpha is equal to 0. it means just the foreground is considered. This fig represent avg body-pose estimation accuracy for different optical flow box size and its weight. Here a sample is presented in which the left hand is not detected well while by involving the of information and increasing the score of the parts that are moving the left hand correctly is detected in our approach. Similarly here I presented the results for considering different beta and gamma and in the figures you see the right hand is mistakenly detected, while by involving foreground information it has correctly detected.
This figure also presents the relative importance of fg and of scores and its affect on the accuracy. As we presented in the paper oF increases the detection rate of the parts that are mostly moving and FG increase the accuracy rate of the other parts more such as the stomach.
Now we made our enriched descriptor and here we will apply Fisher Kernel representation.