Iciap 2

Daily Living Activities Recognition via Efficient
High and Low Level Cues Combination and
Fisher Kernel Representation
Negar Rostamzadeh1
Gloria Zen1
Ionut Mironica2
Jasper Uijlings1
Nicu Sebe1
1 DISI, University of Trento, Trento, Italy
2 LAPI, University Politehnica of Bucharest, Bucharest, Romania

Outline
• Daily Living Action Recognition
• State-of-the-art
• Our approach
• Results
• Conclusion
2/25

Action Recognition in videos
Answer phone or dial phone?
Difficulties in fine-grained activities:
1. Slightly different activities in motion and appearance
2. Different manner of performing the similar task.
Motivation – State of the art – Our approach – Results - Conclusion 3/23

Object Centric approaches- SoA
Object-centric approaches- based on tracking and
trajectory analysis6,16
Advantages
Brendel et al, ICCV 2011 [5], May et al, CVPR 2004 [6]
Campos et al, WACV 2011 [23], Liu et al, CVPR 2007[16]
Limitations
Providing semantic/high-level
information of the scene
Handling occlusions in objects interactions
The broken and missed trajectories
The problem of curse of dimensionality

Non-object centric approaches - SoA
Bag-of-words approach relying on low-level
HoG STIP Foreground pixels HoF
Advantages
Laptev et al, CVPR 2008 [1], Willems et al, ECCV 2008 [2], Hospedales et al, ICCV 2009. Zen et al, CVPR 2011 [4],
Wong et al, CVPR 2007 [15], Chang et al, ICCV 2011 [17], Gilbert et al, ICCV 2009 [19],
Zelniker et al, ICCV 2008 [20], Gehrig et al, ICCV 2009[21], Mahbub et al, ICIEV [25]
5/23
Limitations
Robustness to noise & occlusions
Computational efficiency
1. Discard semantic & high-level
information of the scene.
2. Discard relationship among spatio-temporal
local features.
Motivation – State of the art – Our approach – Results - Conclusion

Enhanced descriptors - SoA
1. Relation between local features
Pair-wise10,11,12,18, local space or time neighborhood11,18, ST phrases9
2. Combining different local features
Such as local motion, appearance, and positions14,24
3. Enriching the combination of low level features with high- level
information
Detect and localize faces7, STIP volume8,9
Which body-part causes what motion?
Messing et al, ICCV, 2009 [7], Fathi et al, 2008 [8], Zhang et al, 2012 [9], Matikainen et al, 2010 [10], : Gaur et al, 2011 [11],
Savarese et al, 2008 [12], Kovashka et al, Gireddy et al, 2011 [14], CVPR 2007[18] , Shechtman et al, CVPR 2011 [24]
6/23

Fusing information
to produce enriched
Low-level cues
Input Video Classifier
descriptor
Apply a Feature-representation
Recognizing
Activities
Body-part
detector
Accumulation over
each video
Fisher Kernel to
model the Temporal
variation
Approach in a glance

Enhanced pose estimator
Body-pose estimation
What is the problem with an off-the-shelf detector?
Our Solution:
Employ an already-trained off-the-shelf
detector
We use the already trained
classifier, but we provide some
additional information from the
new dataset
BUFFY ADL

Body-pose estimation- build on ‘Yang and Ramanan PAMI2012, CVPR11’ [29]
9/25
1. Model the body as a pictorial structure (Felzenshwalb-CVPR 2010)
2. Model the body as a Tree
3. Each possible body-configuration has a score
Local score: HoG Pair-wise score
HoG -
appearance
Scores by employing off-the-shelf
detector = Sinitial

10/25
Relative importance of foreground
and optical flow score
New Score = Sinitial
weights
Foreground Score Optical Flow Score

SSooAA OOuurr a apppprrooaacchh OFpotriceaglr Foluonwd
SoA Our approach Optical Flow SoA Our approach Foreground

12/25
Enhanced pose estimator used to enrich action
recognition approach
Tuning

Body-part
detector
Input Video
Fusing information
to produce enriched
descriptor
Low-level cues
Accumulation over
each video
Classifier
Apply a Feature-representation
Recognizing
Activities
Fisher Kernel to
model the Temporal
variation
Approach in a glance

Fisher Kernel (FK) Theory
Fisher Kernel in the state-of-the-art
1. Introduced by Jaakkol NIPS’99 [26]) for protein detection
2. Web audio classification (Moreno 2000)
3. Introduced in Computer Vision for Image categorization by [Perronnin, CVPR’07]
Fisher Kernel in image categorization Vs video analysis
1. Modeling the : spatial variation temporal variation
2. Visual documents: small patches frames of the video
3. Initial feature vectors : SIFT our novel descriptors for
action recognition

Fisher Kernel (FK) Theory
- combines the benefits of generative and discriminative approaches
- represents a signal as the gradient of the probability density function that is a learned
generative model of that signall

Results on the ADL Rochester dataset

Conclusion
17/25
 We proposed a novel descriptor that is
combining high-level semantic information and
low–level cues.
 We propose an enhanced body-pose estimator.
 We model the Temporal variation by the Fisher-
Kernel representation.

References
1. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008, June). Learning
realistic human actions from movies. In Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE..
2. Willems, G., Tuytelaars, T., & Van Gool, L. (2008). An efficient dense and scale–
invariant spatio–temporal interest point detector. Computer Vision–ECCV
2008, 650–663.
3. Hospedales, T., Gong, S., & Xiang, T. (2009, September). A markov clustering
topic model for mining behaviour in video. In Computer Vision, 2009 IEEE 12th
International Conference on (pp. 1165–1172). IEEE
4. Zen, G., and Ricci, E. "Earth mover's prototypes: A convex learning approach
for discovering activity patterns in dynamic scenes." Computer Vision and
Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
5. Brendel, W., & Todorovic, S. (2011, November). Learning spatiotemporal
graphs of human activities. In Computer Vision (ICCV), 2011 IEEE International
Conference on (pp. 778–785). IEEE.

References
6. Han, M., Xu, W., Tao, H., & Gong, Y. (2004, June). An algorithm for
multiple object trajectory tracking. In Computer Vision and Pattern
Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer
Society Conference on(Vol. 1, pp. I–864). IEEE.
7. Messing, R., Pal, C., & Kautz, H. (2009, September). Activity recognition
using the velocity histories of tracked keypoints. In Computer Vision, 2009
IEEE 12th International Conference on (pp. 104–111). IEEE.
8. Fathi, A., and Mori, G. (2008, June). Action recognition by learning mid–
level motion features. In Computer Vision and Pattern Recognition, 2008.
CVPR 2008. IEEE Conference on (pp. 1–8). IEEE.
9. Zhang, Y., Liu, X., Chang, M. C., Ge, W., & Chen, T. (2012). Spatio–Temporal
phrases for activity recognition. Computer Vision–ECCV 2012, 707–721.
10. Matikainen, M. Hebert, and R. Sukthankar. Representing pairwise spatial
and temporal relations for action recognition. European Conference of
Computer Vision (ECCV) 2010, pages 508{521, 2010.

References
11. Gaur, U., Zhu, Y., Song, B., & Roy–Chowdhury, A. (2011, November). A
“string of feature graphs” model for recognition of complex activities in
natural videos. InComputer Vision (ICCV), 2011 IEEE International
Conference on (pp. 2595–2602). IEEE.
12. Savarese, S., DelPozo, A., Niebles, J. C., & Fei–Fei, L. (2008, January).
Spatial–Temporal correlatons for unsupervised action classification.
In Motion and video Computing, 2008. WMVC 2008. IEEE Workshop
on (pp. 1–8). IEEE.
13. Taralova, E., De la Torre, F., & Hebert, M. (2011, November). Source
constrained clustering. In Computer Vision (ICCV), 2011 IEEE International
Conference on (pp. 1927–1934) IEEE.
14. M. Malgireddy, I. Nwogu, and V. Govindaraju. A generative framework to
investigate the underlying patterns in human activities. International
Conference of Computer Vision Workshops (ICCV Workshops), 2011
15. Wong, S. F., Kim, T. K., & Cipolla, R. (2007, June). Learning motion
categories using both semantic and structural information. In Computer
Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on (pp. 1-
6). IEEE.

References
16. Liu, J., Luo, J., & Shah, M. (2009, June). Recognizing realistic actions from
videos “in the wild”. In Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on (pp. 1996-2003). IEEE.
17. Chang, M. C., Krahnstoever, N., & Ge, W. (2011, November). Probabilistic
group-level motion analysis and scenario recognition. In Computer Vision
(ICCV), 2011 IEEE International Conference on (pp. 747-754). IEEE.
18. Kovashka, A., & Grauman, K. Learning a hierarchy of discriminative space-time
neighborhood features for human action recognition. In Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on(pp.
2046-2053). IEEE.
19. Gilbert, A., Illingworth, J., & Bowden, R. (2009, September). Fast realistic
multi-action recognition using mined dense spatio-temporal features.
In Computer Vision, 2009 IEEE 12th International Conference on (pp. 925-
931). IEEE.
20. E. Zelniker, S. Gong, T. Xiang, et al. Global abnormal behaviour detection
using a network of cctv cameras. In The Eighth International Workshop on
Visual Surveillance-VS2008, 2008.

References
21. Gehrig, D., Kuehne, H., Woerner, A., & Schultz, T. (2009, December).
Hmm-based human motion recognition with optical flow data.
In Humanoid Robots, 2009. Humanoids 2009. 9th IEEE-RAS International
Conference on (pp. 425-430). IEEE.
22. Sadanand, S., & Corso, J. J. (2012, June). Action bank: A high-level
representation of activity in video. In Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on (pp. 1254-1241). IEEE.
23. de Campos, T., Barnard, M., Mikolajczyk, K., Kittler, J., Yan, F., Christmas,
W., & Windridge, D. (2011, January). An evaluation of bags-of-words and
spatio-temporal shapes for action recognition. In Applications of
Computer Vision (WACV), 2011 IEEE Workshop on (pp. 344-351). IEEE.
24. Shechtman, E., & Irani, M. Space-time behavior based correlation. In
Conference of Computer Vision and Pattern Recognition (CVPR), 2011.
25. Mahbub, U., Imtiaz, H., Ahad, M., & Rahman, A. (2012, May). Motion
clustering-based action recognition technique using optical flow.
In Informatics, Electronics & Vision (ICIEV), 2012 International Conference
on (pp. 919-924). IEEE.

References
26. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in
discriminative classifiers. Advances in neural information processing
systems, 487-493.

Iciap 2

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Iciap 2

Similar to Iciap 2 (20)

Iciap 2

Editor's Notes