SlideShare a Scribd company logo
1 of 24
Daily Living Activities Recognition via Efficient 
High and Low Level Cues Combination and 
Fisher Kernel Representation 
Negar Rostamzadeh1 
Gloria Zen1 
Ionut Mironica2 
Jasper Uijlings1 
Nicu Sebe1 
1 DISI, University of Trento, Trento, Italy 
2 LAPI, University Politehnica of Bucharest, Bucharest, Romania
Outline 
• Daily Living Action Recognition 
• State-of-the-art 
• Our approach 
• Results 
• Conclusion 
2/25
Action Recognition in videos 
Answer phone or dial phone? 
Difficulties in fine-grained activities: 
1. Slightly different activities in motion and appearance 
2. Different manner of performing the similar task. 
Motivation – State of the art – Our approach – Results - Conclusion 3/23
Object Centric approaches- SoA 
Object-centric approaches- based on tracking and 
trajectory analysis6,16 
Advantages 
Brendel et al, ICCV 2011 [5], May et al, CVPR 2004 [6] 
Campos et al, WACV 2011 [23], Liu et al, CVPR 2007[16] 
Limitations 
Providing semantic/high-level 
information of the scene 
Handling occlusions in objects interactions 
The broken and missed trajectories 
The problem of curse of dimensionality 
Motivation – State of the art – Our approach – Results - Conclusion 4/23
Non-object centric approaches - SoA 
Bag-of-words approach relying on low-level 
HoG STIP Foreground pixels HoF 
Advantages 
Laptev et al, CVPR 2008 [1], Willems et al, ECCV 2008 [2], Hospedales et al, ICCV 2009. Zen et al, CVPR 2011 [4], 
Wong et al, CVPR 2007 [15], Chang et al, ICCV 2011 [17], Gilbert et al, ICCV 2009 [19], 
Zelniker et al, ICCV 2008 [20], Gehrig et al, ICCV 2009[21], Mahbub et al, ICIEV [25] 
5/23 
Limitations 
Robustness to noise & occlusions 
Computational efficiency 
1. Discard semantic & high-level 
information of the scene. 
2. Discard relationship among spatio-temporal 
local features. 
Motivation – State of the art – Our approach – Results - Conclusion
Enhanced descriptors - SoA 
1. Relation between local features 
Pair-wise10,11,12,18, local space or time neighborhood11,18, ST phrases9 
2. Combining different local features 
Such as local motion, appearance, and positions14,24 
3. Enriching the combination of low level features with high- level 
information 
Detect and localize faces7, STIP volume8,9 
Which body-part causes what motion? 
Messing et al, ICCV, 2009 [7], Fathi et al, 2008 [8], Zhang et al, 2012 [9], Matikainen et al, 2010 [10], : Gaur et al, 2011 [11], 
Savarese et al, 2008 [12], Kovashka et al, Gireddy et al, 2011 [14], CVPR 2007[18] , Shechtman et al, CVPR 2011 [24] 
6/23 
Motivation – State of the art – Our approach – Results - Conclusion
Fusing information 
to produce enriched 
Low-level cues 
Input Video Classifier 
descriptor 
Apply a Feature-representation 
Recognizing 
Activities 
Body-part 
detector 
Accumulation over 
each video 
Fisher Kernel to 
model the Temporal 
variation 
Approach in a glance 
Motivation – State of the art – Our approach – Results - Conclusion 7/25
Enhanced pose estimator 
Body-pose estimation 
What is the problem with an off-the-shelf detector? 
Our Solution: 
Employ an already-trained off-the-shelf 
detector 
We use the already trained 
classifier, but we provide some 
additional information from the 
new dataset 
BUFFY ADL 
Motivation – State of the art – Our approach – Results - Conclusion 8/25
Enhanced pose estimator 
Body-pose estimation- build on ‘Yang and Ramanan PAMI2012, CVPR11’ [29] 
9/25 
1. Model the body as a pictorial structure (Felzenshwalb-CVPR 2010) 
2. Model the body as a Tree 
3. Each possible body-configuration has a score 
Local score: HoG Pair-wise score 
HoG - 
appearance 
Scores by employing off-the-shelf 
detector = Sinitial 
Motivation – State of the art – Our approach – Results - Conclusion
10/25 
Enhanced pose estimator 
Relative importance of foreground 
and optical flow score 
New Score = Sinitial 
weights 
Foreground Score Optical Flow Score 
Motivation – State of the art – Our approach – Results - Conclusion
Enhanced pose estimator 
New Score = Sinitial 
SSooAA OOuurr a apppprrooaacchh OFpotriceaglr Foluonwd 
SoA Our approach Optical Flow SoA Our approach Foreground 
Motivation – State of the art – Our approach – Results - Conclusion 11/25
12/25 
Enhanced pose estimator used to enrich action 
recognition approach 
New Score = Sinitial 
Tuning 
Motivation – State of the art – Our approach – Results - Conclusion
Body-part 
detector 
Input Video 
Fusing information 
to produce enriched 
descriptor 
Low-level cues 
Accumulation over 
each video 
Classifier 
Apply a Feature-representation 
Recognizing 
Activities 
Fisher Kernel to 
model the Temporal 
variation 
Approach in a glance 
Motivation – State of the art – Our approach – Results - Conclusion 13/25
Fisher Kernel (FK) Theory 
Fisher Kernel in the state-of-the-art 
1. Introduced by Jaakkol NIPS’99 [26]) for protein detection 
2. Web audio classification (Moreno 2000) 
3. Introduced in Computer Vision for Image categorization by [Perronnin, CVPR’07] 
Fisher Kernel in image categorization Vs video analysis 
1. Modeling the : spatial variation temporal variation 
2. Visual documents: small patches frames of the video 
3. Initial feature vectors : SIFT our novel descriptors for 
action recognition 
Motivation – State of the art – Our approach – Results - Conclusion 14/25
Fisher Kernel (FK) Theory 
- combines the benefits of generative and discriminative approaches 
- represents a signal as the gradient of the probability density function that is a learned 
generative model of that signall 
Motivation – State of the art – Our approach – Results - Conclusion 15/25
Results on the ADL Rochester dataset 
Motivation – State of the art – Our approach – Results - Conclusion 16/25
Conclusion 
17/25 
 We proposed a novel descriptor that is 
combining high-level semantic information and 
low–level cues. 
 We propose an enhanced body-pose estimator. 
 We model the Temporal variation by the Fisher- 
Kernel representation. 
Motivation – State of the art – Our approach – Results - Conclusion
Thank you!
References 
1. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008, June). Learning 
realistic human actions from movies. In Computer Vision and Pattern 
Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE.. 
2. Willems, G., Tuytelaars, T., & Van Gool, L. (2008). An efficient dense and scale– 
invariant spatio–temporal interest point detector. Computer Vision–ECCV 
2008, 650–663. 
3. Hospedales, T., Gong, S., & Xiang, T. (2009, September). A markov clustering 
topic model for mining behaviour in video. In Computer Vision, 2009 IEEE 12th 
International Conference on (pp. 1165–1172). IEEE 
4. Zen, G., and Ricci, E. "Earth mover's prototypes: A convex learning approach 
for discovering activity patterns in dynamic scenes." Computer Vision and 
Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011. 
5. Brendel, W., & Todorovic, S. (2011, November). Learning spatiotemporal 
graphs of human activities. In Computer Vision (ICCV), 2011 IEEE International 
Conference on (pp. 778–785). IEEE.
References 
6. Han, M., Xu, W., Tao, H., & Gong, Y. (2004, June). An algorithm for 
multiple object trajectory tracking. In Computer Vision and Pattern 
Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer 
Society Conference on(Vol. 1, pp. I–864). IEEE. 
7. Messing, R., Pal, C., & Kautz, H. (2009, September). Activity recognition 
using the velocity histories of tracked keypoints. In Computer Vision, 2009 
IEEE 12th International Conference on (pp. 104–111). IEEE. 
8. Fathi, A., and Mori, G. (2008, June). Action recognition by learning mid– 
level motion features. In Computer Vision and Pattern Recognition, 2008. 
CVPR 2008. IEEE Conference on (pp. 1–8). IEEE. 
9. Zhang, Y., Liu, X., Chang, M. C., Ge, W., & Chen, T. (2012). Spatio–Temporal 
phrases for activity recognition. Computer Vision–ECCV 2012, 707–721. 
10. Matikainen, M. Hebert, and R. Sukthankar. Representing pairwise spatial 
and temporal relations for action recognition. European Conference of 
Computer Vision (ECCV) 2010, pages 508{521, 2010.
References 
11. Gaur, U., Zhu, Y., Song, B., & Roy–Chowdhury, A. (2011, November). A 
“string of feature graphs” model for recognition of complex activities in 
natural videos. InComputer Vision (ICCV), 2011 IEEE International 
Conference on (pp. 2595–2602). IEEE. 
12. Savarese, S., DelPozo, A., Niebles, J. C., & Fei–Fei, L. (2008, January). 
Spatial–Temporal correlatons for unsupervised action classification. 
In Motion and video Computing, 2008. WMVC 2008. IEEE Workshop 
on (pp. 1–8). IEEE. 
13. Taralova, E., De la Torre, F., & Hebert, M. (2011, November). Source 
constrained clustering. In Computer Vision (ICCV), 2011 IEEE International 
Conference on (pp. 1927–1934) IEEE. 
14. M. Malgireddy, I. Nwogu, and V. Govindaraju. A generative framework to 
investigate the underlying patterns in human activities. International 
Conference of Computer Vision Workshops (ICCV Workshops), 2011 
15. Wong, S. F., Kim, T. K., & Cipolla, R. (2007, June). Learning motion 
categories using both semantic and structural information. In Computer 
Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on (pp. 1- 
6). IEEE.
References 
16. Liu, J., Luo, J., & Shah, M. (2009, June). Recognizing realistic actions from 
videos “in the wild”. In Computer Vision and Pattern Recognition, 2009. 
CVPR 2009. IEEE Conference on (pp. 1996-2003). IEEE. 
17. Chang, M. C., Krahnstoever, N., & Ge, W. (2011, November). Probabilistic 
group-level motion analysis and scenario recognition. In Computer Vision 
(ICCV), 2011 IEEE International Conference on (pp. 747-754). IEEE. 
18. Kovashka, A., & Grauman, K. Learning a hierarchy of discriminative space-time 
neighborhood features for human action recognition. In Computer 
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on(pp. 
2046-2053). IEEE. 
19. Gilbert, A., Illingworth, J., & Bowden, R. (2009, September). Fast realistic 
multi-action recognition using mined dense spatio-temporal features. 
In Computer Vision, 2009 IEEE 12th International Conference on (pp. 925- 
931). IEEE. 
20. E. Zelniker, S. Gong, T. Xiang, et al. Global abnormal behaviour detection 
using a network of cctv cameras. In The Eighth International Workshop on 
Visual Surveillance-VS2008, 2008.
References 
21. Gehrig, D., Kuehne, H., Woerner, A., & Schultz, T. (2009, December). 
Hmm-based human motion recognition with optical flow data. 
In Humanoid Robots, 2009. Humanoids 2009. 9th IEEE-RAS International 
Conference on (pp. 425-430). IEEE. 
22. Sadanand, S., & Corso, J. J. (2012, June). Action bank: A high-level 
representation of activity in video. In Computer Vision and Pattern 
Recognition (CVPR), 2012 IEEE Conference on (pp. 1254-1241). IEEE. 
23. de Campos, T., Barnard, M., Mikolajczyk, K., Kittler, J., Yan, F., Christmas, 
W., & Windridge, D. (2011, January). An evaluation of bags-of-words and 
spatio-temporal shapes for action recognition. In Applications of 
Computer Vision (WACV), 2011 IEEE Workshop on (pp. 344-351). IEEE. 
24. Shechtman, E., & Irani, M. Space-time behavior based correlation. In 
Conference of Computer Vision and Pattern Recognition (CVPR), 2011. 
25. Mahbub, U., Imtiaz, H., Ahad, M., & Rahman, A. (2012, May). Motion 
clustering-based action recognition technique using optical flow. 
In Informatics, Electronics & Vision (ICIEV), 2012 International Conference 
on (pp. 919-924). IEEE.
References 
26. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in 
discriminative classifiers. Advances in neural information processing 
systems, 487-493.

More Related Content

Viewers also liked

Magazine design
Magazine designMagazine design
Magazine designbhussain07
 
Music magazine conventions
Music magazine conventionsMusic magazine conventions
Music magazine conventionsbhussain07
 
Prezentare Doctorat Ionut Mironica
Prezentare Doctorat Ionut MironicaPrezentare Doctorat Ionut Mironica
Prezentare Doctorat Ionut MironicaIonut Mironica
 
Productos y platos tipicos de loja
Productos y platos tipicos de lojaProductos y platos tipicos de loja
Productos y platos tipicos de lojaByron Fernando
 
A Modified Feature Relevance Estimation Approach to Relevance Feedback in Con...
A Modified Feature Relevance Estimation Approach to Relevance Feedback in Con...A Modified Feature Relevance Estimation Approach to Relevance Feedback in Con...
A Modified Feature Relevance Estimation Approach to Relevance Feedback in Con...Ionut Mironica
 
Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval
Fisher Kernel based Relevance Feedback for Multimodal Video RetrievalFisher Kernel based Relevance Feedback for Multimodal Video Retrieval
Fisher Kernel based Relevance Feedback for Multimodal Video RetrievalIonut Mironica
 
An In-Depth Evaluation of Multimodal Video Genre Categorization
An In-Depth Evaluation of Multimodal Video Genre CategorizationAn In-Depth Evaluation of Multimodal Video Genre Categorization
An In-Depth Evaluation of Multimodal Video Genre CategorizationIonut Mironica
 
Our ideal school
Our ideal schoolOur ideal school
Our ideal schoolmax-sd
 
Saliency Detection via Divergence Analysis: A Unified Perspective ICPR 2012
Saliency Detection via Divergence Analysis: A Unified Perspective ICPR 2012Saliency Detection via Divergence Analysis: A Unified Perspective ICPR 2012
Saliency Detection via Divergence Analysis: A Unified Perspective ICPR 2012Jia-Bin Huang
 

Viewers also liked (20)

Matlab titles(2013 14)
Matlab titles(2013 14)Matlab titles(2013 14)
Matlab titles(2013 14)
 
Com fer blog
Com fer blogCom fer blog
Com fer blog
 
Com fer blog
Com fer blogCom fer blog
Com fer blog
 
Test 1
Test 1Test 1
Test 1
 
Magazine design
Magazine designMagazine design
Magazine design
 
Music magazine conventions
Music magazine conventionsMusic magazine conventions
Music magazine conventions
 
Prezentare Doctorat Ionut Mironica
Prezentare Doctorat Ionut MironicaPrezentare Doctorat Ionut Mironica
Prezentare Doctorat Ionut Mironica
 
491 finaldavis
491 finaldavis491 finaldavis
491 finaldavis
 
Productos y platos tipicos de loja
Productos y platos tipicos de lojaProductos y platos tipicos de loja
Productos y platos tipicos de loja
 
A Modified Feature Relevance Estimation Approach to Relevance Feedback in Con...
A Modified Feature Relevance Estimation Approach to Relevance Feedback in Con...A Modified Feature Relevance Estimation Approach to Relevance Feedback in Con...
A Modified Feature Relevance Estimation Approach to Relevance Feedback in Con...
 
Practical tips for managing change
Practical tips for managing changePractical tips for managing change
Practical tips for managing change
 
Time management tips
Time management tipsTime management tips
Time management tips
 
London riots
London riots London riots
London riots
 
Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval
Fisher Kernel based Relevance Feedback for Multimodal Video RetrievalFisher Kernel based Relevance Feedback for Multimodal Video Retrieval
Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval
 
An In-Depth Evaluation of Multimodal Video Genre Categorization
An In-Depth Evaluation of Multimodal Video Genre CategorizationAn In-Depth Evaluation of Multimodal Video Genre Categorization
An In-Depth Evaluation of Multimodal Video Genre Categorization
 
Presentation
PresentationPresentation
Presentation
 
Final240
Final240Final240
Final240
 
Our ideal school
Our ideal schoolOur ideal school
Our ideal school
 
Evaluation Q1
Evaluation Q1Evaluation Q1
Evaluation Q1
 
Saliency Detection via Divergence Analysis: A Unified Perspective ICPR 2012
Saliency Detection via Divergence Analysis: A Unified Perspective ICPR 2012Saliency Detection via Divergence Analysis: A Unified Perspective ICPR 2012
Saliency Detection via Divergence Analysis: A Unified Perspective ICPR 2012
 

Similar to Iciap 2

Human Action Recognition Based on Spacio-temporal features
Human Action Recognition Based on Spacio-temporal featuresHuman Action Recognition Based on Spacio-temporal features
Human Action Recognition Based on Spacio-temporal featuresnikhilus85
 
TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019sipij
 
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES sipij
 
Fast Feature Pyramids for Object Detection
Fast Feature Pyramids for Object DetectionFast Feature Pyramids for Object Detection
Fast Feature Pyramids for Object Detectionsuthi
 
Object detection elearning
Object detection elearningObject detection elearning
Object detection elearningLavanya Sharma
 
Yanjun Chen_1017_English Version
Yanjun Chen_1017_English VersionYanjun Chen_1017_English Version
Yanjun Chen_1017_English VersionYanjun Chen
 
Human Pose Estimation by Deep Learning
Human Pose Estimation by Deep LearningHuman Pose Estimation by Deep Learning
Human Pose Estimation by Deep LearningWei Yang
 
IRJET- Identification of Missing Person in the Crowd using Pretrained Neu...
IRJET-  	  Identification of Missing Person in the Crowd using Pretrained Neu...IRJET-  	  Identification of Missing Person in the Crowd using Pretrained Neu...
IRJET- Identification of Missing Person in the Crowd using Pretrained Neu...IRJET Journal
 
Jia-Bin Huang's Curriculum Vitae
Jia-Bin Huang's Curriculum VitaeJia-Bin Huang's Curriculum Vitae
Jia-Bin Huang's Curriculum VitaeJia-Bin Huang
 
HON4D (O. Oreifej et al., CVPR2013)
HON4D (O. Oreifej et al., CVPR2013)HON4D (O. Oreifej et al., CVPR2013)
HON4D (O. Oreifej et al., CVPR2013)Mitsuru Nakazawa
 
Predicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representationsPredicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representationsAlison Reboud
 
Unsupervised object-level video summarization with online motion auto-encoder
Unsupervised object-level video summarization with online motion auto-encoderUnsupervised object-level video summarization with online motion auto-encoder
Unsupervised object-level video summarization with online motion auto-encoderNEERAJ BAGHEL
 
Technical Evaluation of HoloLens for Multimedia: A First Look
Technical Evaluation of HoloLens for Multimedia: A First LookTechnical Evaluation of HoloLens for Multimedia: A First Look
Technical Evaluation of HoloLens for Multimedia: A First LookHaiwei Dong
 
Object Detection using SURF features
Object Detection using SURF featuresObject Detection using SURF features
Object Detection using SURF featuresIRJET Journal
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET Journal
 
Soundarya m.sc
Soundarya m.scSoundarya m.sc
Soundarya m.scsowfi
 
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATION
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATIONYOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATION
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATIONIRJET Journal
 
IRJET- A Review on Moving Object Detection in Video Forensics
IRJET- A Review on Moving Object Detection in Video Forensics IRJET- A Review on Moving Object Detection in Video Forensics
IRJET- A Review on Moving Object Detection in Video Forensics IRJET Journal
 
Selective local binary pattern with convolutional neural network for facial ...
Selective local binary pattern with convolutional neural  network for facial ...Selective local binary pattern with convolutional neural  network for facial ...
Selective local binary pattern with convolutional neural network for facial ...IJECEIAES
 

Similar to Iciap 2 (20)

Human Action Recognition Based on Spacio-temporal features
Human Action Recognition Based on Spacio-temporal featuresHuman Action Recognition Based on Spacio-temporal features
Human Action Recognition Based on Spacio-temporal features
 
TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019TOP 5 Most View Article From Academia in 2019
TOP 5 Most View Article From Academia in 2019
 
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES
 
Fast Feature Pyramids for Object Detection
Fast Feature Pyramids for Object DetectionFast Feature Pyramids for Object Detection
Fast Feature Pyramids for Object Detection
 
Object detection elearning
Object detection elearningObject detection elearning
Object detection elearning
 
Yanjun Chen_1017_English Version
Yanjun Chen_1017_English VersionYanjun Chen_1017_English Version
Yanjun Chen_1017_English Version
 
Human Pose Estimation by Deep Learning
Human Pose Estimation by Deep LearningHuman Pose Estimation by Deep Learning
Human Pose Estimation by Deep Learning
 
IRJET- Identification of Missing Person in the Crowd using Pretrained Neu...
IRJET-  	  Identification of Missing Person in the Crowd using Pretrained Neu...IRJET-  	  Identification of Missing Person in the Crowd using Pretrained Neu...
IRJET- Identification of Missing Person in the Crowd using Pretrained Neu...
 
Jia-Bin Huang's Curriculum Vitae
Jia-Bin Huang's Curriculum VitaeJia-Bin Huang's Curriculum Vitae
Jia-Bin Huang's Curriculum Vitae
 
HON4D (O. Oreifej et al., CVPR2013)
HON4D (O. Oreifej et al., CVPR2013)HON4D (O. Oreifej et al., CVPR2013)
HON4D (O. Oreifej et al., CVPR2013)
 
Predicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representationsPredicting Media Memorability with Audio, Video, and Text representations
Predicting Media Memorability with Audio, Video, and Text representations
 
Unsupervised object-level video summarization with online motion auto-encoder
Unsupervised object-level video summarization with online motion auto-encoderUnsupervised object-level video summarization with online motion auto-encoder
Unsupervised object-level video summarization with online motion auto-encoder
 
Technical Evaluation of HoloLens for Multimedia: A First Look
Technical Evaluation of HoloLens for Multimedia: A First LookTechnical Evaluation of HoloLens for Multimedia: A First Look
Technical Evaluation of HoloLens for Multimedia: A First Look
 
Object Detection using SURF features
Object Detection using SURF featuresObject Detection using SURF features
Object Detection using SURF features
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound Recognition
 
Soundarya m.sc
Soundarya m.scSoundarya m.sc
Soundarya m.sc
 
Audio-
Audio-Audio-
Audio-
 
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATION
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATIONYOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATION
YOLO BASED SHIP IMAGE DETECTION AND CLASSIFICATION
 
IRJET- A Review on Moving Object Detection in Video Forensics
IRJET- A Review on Moving Object Detection in Video Forensics IRJET- A Review on Moving Object Detection in Video Forensics
IRJET- A Review on Moving Object Detection in Video Forensics
 
Selective local binary pattern with convolutional neural network for facial ...
Selective local binary pattern with convolutional neural  network for facial ...Selective local binary pattern with convolutional neural  network for facial ...
Selective local binary pattern with convolutional neural network for facial ...
 

Iciap 2

  • 1. Daily Living Activities Recognition via Efficient High and Low Level Cues Combination and Fisher Kernel Representation Negar Rostamzadeh1 Gloria Zen1 Ionut Mironica2 Jasper Uijlings1 Nicu Sebe1 1 DISI, University of Trento, Trento, Italy 2 LAPI, University Politehnica of Bucharest, Bucharest, Romania
  • 2. Outline • Daily Living Action Recognition • State-of-the-art • Our approach • Results • Conclusion 2/25
  • 3. Action Recognition in videos Answer phone or dial phone? Difficulties in fine-grained activities: 1. Slightly different activities in motion and appearance 2. Different manner of performing the similar task. Motivation – State of the art – Our approach – Results - Conclusion 3/23
  • 4. Object Centric approaches- SoA Object-centric approaches- based on tracking and trajectory analysis6,16 Advantages Brendel et al, ICCV 2011 [5], May et al, CVPR 2004 [6] Campos et al, WACV 2011 [23], Liu et al, CVPR 2007[16] Limitations Providing semantic/high-level information of the scene Handling occlusions in objects interactions The broken and missed trajectories The problem of curse of dimensionality Motivation – State of the art – Our approach – Results - Conclusion 4/23
  • 5. Non-object centric approaches - SoA Bag-of-words approach relying on low-level HoG STIP Foreground pixels HoF Advantages Laptev et al, CVPR 2008 [1], Willems et al, ECCV 2008 [2], Hospedales et al, ICCV 2009. Zen et al, CVPR 2011 [4], Wong et al, CVPR 2007 [15], Chang et al, ICCV 2011 [17], Gilbert et al, ICCV 2009 [19], Zelniker et al, ICCV 2008 [20], Gehrig et al, ICCV 2009[21], Mahbub et al, ICIEV [25] 5/23 Limitations Robustness to noise & occlusions Computational efficiency 1. Discard semantic & high-level information of the scene. 2. Discard relationship among spatio-temporal local features. Motivation – State of the art – Our approach – Results - Conclusion
  • 6. Enhanced descriptors - SoA 1. Relation between local features Pair-wise10,11,12,18, local space or time neighborhood11,18, ST phrases9 2. Combining different local features Such as local motion, appearance, and positions14,24 3. Enriching the combination of low level features with high- level information Detect and localize faces7, STIP volume8,9 Which body-part causes what motion? Messing et al, ICCV, 2009 [7], Fathi et al, 2008 [8], Zhang et al, 2012 [9], Matikainen et al, 2010 [10], : Gaur et al, 2011 [11], Savarese et al, 2008 [12], Kovashka et al, Gireddy et al, 2011 [14], CVPR 2007[18] , Shechtman et al, CVPR 2011 [24] 6/23 Motivation – State of the art – Our approach – Results - Conclusion
  • 7. Fusing information to produce enriched Low-level cues Input Video Classifier descriptor Apply a Feature-representation Recognizing Activities Body-part detector Accumulation over each video Fisher Kernel to model the Temporal variation Approach in a glance Motivation – State of the art – Our approach – Results - Conclusion 7/25
  • 8. Enhanced pose estimator Body-pose estimation What is the problem with an off-the-shelf detector? Our Solution: Employ an already-trained off-the-shelf detector We use the already trained classifier, but we provide some additional information from the new dataset BUFFY ADL Motivation – State of the art – Our approach – Results - Conclusion 8/25
  • 9. Enhanced pose estimator Body-pose estimation- build on ‘Yang and Ramanan PAMI2012, CVPR11’ [29] 9/25 1. Model the body as a pictorial structure (Felzenshwalb-CVPR 2010) 2. Model the body as a Tree 3. Each possible body-configuration has a score Local score: HoG Pair-wise score HoG - appearance Scores by employing off-the-shelf detector = Sinitial Motivation – State of the art – Our approach – Results - Conclusion
  • 10. 10/25 Enhanced pose estimator Relative importance of foreground and optical flow score New Score = Sinitial weights Foreground Score Optical Flow Score Motivation – State of the art – Our approach – Results - Conclusion
  • 11. Enhanced pose estimator New Score = Sinitial SSooAA OOuurr a apppprrooaacchh OFpotriceaglr Foluonwd SoA Our approach Optical Flow SoA Our approach Foreground Motivation – State of the art – Our approach – Results - Conclusion 11/25
  • 12. 12/25 Enhanced pose estimator used to enrich action recognition approach New Score = Sinitial Tuning Motivation – State of the art – Our approach – Results - Conclusion
  • 13. Body-part detector Input Video Fusing information to produce enriched descriptor Low-level cues Accumulation over each video Classifier Apply a Feature-representation Recognizing Activities Fisher Kernel to model the Temporal variation Approach in a glance Motivation – State of the art – Our approach – Results - Conclusion 13/25
  • 14. Fisher Kernel (FK) Theory Fisher Kernel in the state-of-the-art 1. Introduced by Jaakkol NIPS’99 [26]) for protein detection 2. Web audio classification (Moreno 2000) 3. Introduced in Computer Vision for Image categorization by [Perronnin, CVPR’07] Fisher Kernel in image categorization Vs video analysis 1. Modeling the : spatial variation temporal variation 2. Visual documents: small patches frames of the video 3. Initial feature vectors : SIFT our novel descriptors for action recognition Motivation – State of the art – Our approach – Results - Conclusion 14/25
  • 15. Fisher Kernel (FK) Theory - combines the benefits of generative and discriminative approaches - represents a signal as the gradient of the probability density function that is a learned generative model of that signall Motivation – State of the art – Our approach – Results - Conclusion 15/25
  • 16. Results on the ADL Rochester dataset Motivation – State of the art – Our approach – Results - Conclusion 16/25
  • 17. Conclusion 17/25  We proposed a novel descriptor that is combining high-level semantic information and low–level cues.  We propose an enhanced body-pose estimator.  We model the Temporal variation by the Fisher- Kernel representation. Motivation – State of the art – Our approach – Results - Conclusion
  • 19. References 1. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008, June). Learning realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE.. 2. Willems, G., Tuytelaars, T., & Van Gool, L. (2008). An efficient dense and scale– invariant spatio–temporal interest point detector. Computer Vision–ECCV 2008, 650–663. 3. Hospedales, T., Gong, S., & Xiang, T. (2009, September). A markov clustering topic model for mining behaviour in video. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 1165–1172). IEEE 4. Zen, G., and Ricci, E. "Earth mover's prototypes: A convex learning approach for discovering activity patterns in dynamic scenes." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011. 5. Brendel, W., & Todorovic, S. (2011, November). Learning spatiotemporal graphs of human activities. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 778–785). IEEE.
  • 20. References 6. Han, M., Xu, W., Tao, H., & Gong, Y. (2004, June). An algorithm for multiple object trajectory tracking. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on(Vol. 1, pp. I–864). IEEE. 7. Messing, R., Pal, C., & Kautz, H. (2009, September). Activity recognition using the velocity histories of tracked keypoints. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 104–111). IEEE. 8. Fathi, A., and Mori, G. (2008, June). Action recognition by learning mid– level motion features. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE. 9. Zhang, Y., Liu, X., Chang, M. C., Ge, W., & Chen, T. (2012). Spatio–Temporal phrases for activity recognition. Computer Vision–ECCV 2012, 707–721. 10. Matikainen, M. Hebert, and R. Sukthankar. Representing pairwise spatial and temporal relations for action recognition. European Conference of Computer Vision (ECCV) 2010, pages 508{521, 2010.
  • 21. References 11. Gaur, U., Zhu, Y., Song, B., & Roy–Chowdhury, A. (2011, November). A “string of feature graphs” model for recognition of complex activities in natural videos. InComputer Vision (ICCV), 2011 IEEE International Conference on (pp. 2595–2602). IEEE. 12. Savarese, S., DelPozo, A., Niebles, J. C., & Fei–Fei, L. (2008, January). Spatial–Temporal correlatons for unsupervised action classification. In Motion and video Computing, 2008. WMVC 2008. IEEE Workshop on (pp. 1–8). IEEE. 13. Taralova, E., De la Torre, F., & Hebert, M. (2011, November). Source constrained clustering. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 1927–1934) IEEE. 14. M. Malgireddy, I. Nwogu, and V. Govindaraju. A generative framework to investigate the underlying patterns in human activities. International Conference of Computer Vision Workshops (ICCV Workshops), 2011 15. Wong, S. F., Kim, T. K., & Cipolla, R. (2007, June). Learning motion categories using both semantic and structural information. In Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on (pp. 1- 6). IEEE.
  • 22. References 16. Liu, J., Luo, J., & Shah, M. (2009, June). Recognizing realistic actions from videos “in the wild”. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 1996-2003). IEEE. 17. Chang, M. C., Krahnstoever, N., & Ge, W. (2011, November). Probabilistic group-level motion analysis and scenario recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 747-754). IEEE. 18. Kovashka, A., & Grauman, K. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on(pp. 2046-2053). IEEE. 19. Gilbert, A., Illingworth, J., & Bowden, R. (2009, September). Fast realistic multi-action recognition using mined dense spatio-temporal features. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 925- 931). IEEE. 20. E. Zelniker, S. Gong, T. Xiang, et al. Global abnormal behaviour detection using a network of cctv cameras. In The Eighth International Workshop on Visual Surveillance-VS2008, 2008.
  • 23. References 21. Gehrig, D., Kuehne, H., Woerner, A., & Schultz, T. (2009, December). Hmm-based human motion recognition with optical flow data. In Humanoid Robots, 2009. Humanoids 2009. 9th IEEE-RAS International Conference on (pp. 425-430). IEEE. 22. Sadanand, S., & Corso, J. J. (2012, June). Action bank: A high-level representation of activity in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 1254-1241). IEEE. 23. de Campos, T., Barnard, M., Mikolajczyk, K., Kittler, J., Yan, F., Christmas, W., & Windridge, D. (2011, January). An evaluation of bags-of-words and spatio-temporal shapes for action recognition. In Applications of Computer Vision (WACV), 2011 IEEE Workshop on (pp. 344-351). IEEE. 24. Shechtman, E., & Irani, M. Space-time behavior based correlation. In Conference of Computer Vision and Pattern Recognition (CVPR), 2011. 25. Mahbub, U., Imtiaz, H., Ahad, M., & Rahman, A. (2012, May). Motion clustering-based action recognition technique using optical flow. In Informatics, Electronics & Vision (ICIEV), 2012 International Conference on (pp. 919-924). IEEE.
  • 24. References 26. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in discriminative classifiers. Advances in neural information processing systems, 487-493.

Editor's Notes

  1. Hello, my name is Negar Rostamzadeh. I am working under the supervision of prof. Nicu Sebe at the University of Trernto, Italy. This work is a joint work with my colleagues Gloria Zen, Ionut Mironica and Jasper Ujilings.
  2. To start with, I will talk about the difficulty of action recognition in daily living scenarios. I will first cover the related work and then I will present the steps of the approach & the action recognition and pose estimation results on ADL dataset. Finally, I will draw some concluding remarks.
  3. Human centric action recognition is always a challenging problem both in images and in videos. In some video scenarios, it is possible to extract a few frames from the video and then label the activity based on the information taken from those single frames, while in some scenarios it is very difficult to label the activities based on just a few frames. As an example take a look at these 2 images. In both there is a person who is looking at a cell phone but without looking at the next frames whether he is going to answer the phone or dial a phone number, we don’t have enough cues to recognize among these 2 actions. As you see fine grained activities are the activities that are slightly different with each other in terms of motion, appearance and the objects that are present in the scene. Moreover the same action can also be done in very different manners.
  4. Now let’s see what has been done in the state-of-the-art. There is a type of approach in the state-of-the-art that can be called Object-centric approaches. These approaches are based on defining objects of interests (here let’s say human), detecting and tracking them in consecutive frames. These approaches provide high-level information about the detected objects but they are unable to handle occlusions and the broken trajectories problem may occur.
  5. The other group of approaches are non-object-centric approaches. Non-object centric approaches are popular nowadays and they are based on low level cues in a bag-of-words framework. These approaches are more robust to noise and occlusions and computationally more efficient but they also have some limitation. 1st of all because of jumping directly from low-level cues to the high-level information some semantic and high-level information is discarded. Moreover the relationship among low-level cues is discarded while these cues are analyzed in a bag-of-words framework.
  6. Well, some approaches addressed these problems and tried to solve them by enhancing the descriptors. Some of them involve the relationship among local features in the descriptor. Such as the relation among pairs of local features, relation in the time or space local neighborhood and recently the space-time phrases. Some tried to enrich descriptors with a combination of different local features, such as local motions, local appearance and positions. Finally the 3d types of method enrich descriptors by a combination of local-features and high-level information that comes from detectors. Our approach can also be considered in this group. We define different parts of the body as our semantic classes and try to detect them and find out which motion is belonging to which body-part? Which body-part causes what motion?
  7. In this work we enrich the descriptor by a combination of low and high level information. Local motion is obtained by optical flow and quantized in 8 directions. Detected body parts represent different semantic classes. Then for fusing this information we make a vector of 8 bins for each body part and find the motions that are belonging to each body part. Afterward we concatenate the motion vectors of all semantic classes. Then we represent feature descriptors of different videos by 2 representations. The first one is a simple accumulation of these feature vectors over each video and the second one is with the use of fisher kernel representation in order to model the temporal variation. Then by applying a classification method these videos are classified and the recognizing task is finished. In the following section I will also talk about our enhanced body-part detector method and applying the fisher kernel approach to model the temporal variation.
  8. In the case of body pose estimation, a significant drop in accuracy has been observed when a detector is trained on one dataset and it is evaluated on a different one. A possible solution to this is to set the body pose groundtruth for the new dataset and re-train the classifier. However this procedure is very expensive and requires a consistent delay every time a new dataset has to be analyzed. Instead of training another classifier on the new dataset, we propose to use the already trained classifier, but we provide some additional information from the new dataset.
  9. We build our approach on top of “Yang and Ramanan” approach that is among the best approaches in the literature. I will firstly summarize their approach. Their model is a pictorial model. In the pictorial structure body is modeled as an ideal template in a graph structure, while single body parts are the nodes of graph and edges presented by springs. Then different configuration are made by deforming the main template. Yang and Ramanan model this graph as a tree to simplify the model and also each node just be connected with a spring to a parent node. Then a score is given to each possible body-configuration. This score is made by summing up the single body part scores and pairwise scores. We call this score as the initial score.
  10. Here I have presented our new score. These are the foreground and optical flow scores. FG score represent the ratio number of foreground pixels presented in a box inside the predicted bounding box. The size of this box is related to the parameter gamma. The OF score is also computed similarly as the ratio number of ofs in a box with the related size of lambda. Beta and eta represent the weights of fg and of scores, while alpha is the relative importance of these 2 score. It means that alpha present which of the optical flow or foreground score improves the detection rate more.
  11. Here I have presented the case that alpha is equal to 0. it means just the foreground is considered. This fig represent avg body-pose estimation accuracy for different optical flow box size and its weight. Here a sample is presented in which the left hand is not detected well while by involving the of information and increasing the score of the parts that are moving the left hand correctly is detected in our approach. Similarly here I presented the results for considering different beta and gamma and in the figures you see the right hand is mistakenly detected, while by involving foreground information it has correctly detected.
  12. This figure also presents the relative importance of fg and of scores and its affect on the accuracy. As we presented in the paper oF increases the detection rate of the parts that are mostly moving and FG increase the accuracy rate of the other parts more such as the stomach.
  13. Now we made our enriched descriptor and here we will apply Fisher Kernel representation.
  14. 14
  15. 15