SlideShare a Scribd company logo
ViPr Reading Group
Meeting# 02
AI&IC Lab, FCI
MMU Cyberjaya
02/05/2016
My Introduction
Saimunur Rahman
Graduate Research Assistant (JS)
Centre of Visual Computing
Multimedia University, Cyberjaya Campus
Facebook: fb.me/saimunur.rahman
Web: http://saimunur.github.io
Today’s Agenda
• Talk on “Action Recognition with Trajectory-Pooled Deep-
Convolutional Descriptors” by L. Wang, Y. Qiao, and X. Tang
published at CVPR 2015.
Lets begin with some vision-based
action recognition basics
Action Recognition
Machine interpretation of human actions in video
Inherent Complexity
A single activity can be performed in many ways
Video Source: YouTube
Many degrees of freedom
Large capability set, 206 bones ans 230 joints in total!!
Video Source: YouTube
View specific
Same activity can be viewed differently
Video Source: UCF-YouTube (Liu et al. 2009)
Subject dependency
Same activities can be performed differently by different people
Video Source: UCF-YouTube (Liu et al. 2009)
Why it is even relevant?
Video source: YouTube
Research Trend in Action Recognition
• Hand-crafted method
‒ Holistic or global method
‒ Localized method
• Unsupervised method
‒ Deep learning
• Fusion
Action Recognition with Trajectory-
Pooled Deep-Convolutional Descriptors
Limin Wang1,2,Yu Qiao2, Xiaoou Tang1,2
1The Chinese University of Hong Kong
2Shenzhen Institutes of Advanced Technology
CVPR 2015 Poster
Total Citation : 40 (18 in 2016)
Main Idea
• Utilize deep architectures to learn conv. feature maps
• Apply trajectory based pooling to aggregate conv. features into
effective descriptors
• Aims to combine the benefits of both hand-crafted and deep-
learned features.
Motivations
• Hand-crafted methods are lack of discriminative capacity
• Current deep learning do not differentiate between spatial and
temporal domain
‒ Treat temporal dimension as feature channels when image trained
ConvNet is use to model videos
Contributions
• Modified Two-stream CNN model [Simonyan and Zisserman, NIPS 2014] trained on
UCF-101 [Soomro et al., CoRR 2012]
• Two CNN normalization method
• Thorough evaluation of later Convolution layers (Conv. 3,4,5)
• Multi-scale extension
Overview
Trajectory extraction
• Used improved dense trajectory (iDT) [Wang et al., ICCV 13]
• Camera motion removal
‒ Compute optical flow
‒ Homography estimation using RANSAC [Fischler & Bolles. 1981]
‒ SURF and Optical flow (OF) for similarity between two frames
‒ Re-compute the optical flow – warped flow
• Trajectory estimation
‒ Trajectories using dense trajectories [Wang et al. 11]
‒ Track points with original spatial scale (results 2-3% less
than multi-scale) [Wang et al. 11]
Image reproduced from Wang et al. 2013
Trajectory detectionInput video
Input video source: YouTube
Feature map extraction
Two-stream network [Simonyan and Zisserman, NIPS 2014], Use CNN-M-2048 model [Chatfield et al, BMVC 2014]
Proposed network model of both spatial and temporal stream
Feature map extraction (2)
• Spatial-net: frame-by-frame
• Temporal-net: stack optical flow volume (one frame is replicated)
• Trajectory mapping:
‒ Zero-padding 𝑘/2 , 𝑘 is kernel size in conv and pooling
‒ Trajectory point mapping: (𝑥, 𝑦, 𝑡) → (𝑟 ∗ 𝑥, 𝑟 ∗ 𝑦, 𝑡), 𝑟 is feature map ratio w.r.t input
image
Trajectory pooled descriptor (TDD)
• Local trajectory-aligned descriptor computed in a 3D volume around the
trajectory.
• The size is 𝑁 × 𝑁 × 𝑃 where, 𝑁 is spatial size and 𝑃 is traj. Length.
• Feature Normalization (Ensure everything is in same range and equ. Cont.)
‒ Spatiotemporal Normalization: 𝐶𝑠𝑡(𝑥, 𝑦, 𝑡, 𝑛) = 𝐶(𝑥, 𝑦, 𝑡, 𝑛)/𝑚𝑎𝑥 𝑥,𝑦,𝑡(𝑥, 𝑦, 𝑡, 𝑛)
‒ Channel Normalization: 𝐶𝑠𝑡(𝑥, 𝑦, 𝑡, 𝑛) = 𝐶(𝑥, 𝑦, 𝑡, 𝑛)/𝑚𝑎𝑥 𝑛(𝑥, 𝑦, 𝑡, 𝑛)
• TDD estimation is done by sum-pooling normalized channels over the
trajectory: 𝐷 𝑇𝑘, 𝐶 𝑚 = 𝑝=1
𝑃
𝐶 𝑚 𝑟 𝑚 × 𝑥 𝑝
𝑘 , 𝑟 𝑚 × 𝑦𝑝
𝑘 , 𝑡 𝑘
Multi-scale TDD
1. Multi-scale pyramid representations of video frames and optical flow fields.
2. Pyramid representations are fed into the two stream ConvNets for multi-
scale feature map
3. Calculate multi-scale TDD: (𝑥, 𝑦, 𝑡) → (𝑟 𝑚 × 𝑠 × 𝑥, 𝑟 𝑚 × 𝑠 × 𝑦, 𝑡), 𝑠 is the
scale of features and 𝑠 =
1
2
,
1
2
, 1, 2, 2
Spatial net pyramid Temporal net pyramid
Datasets
• HMDB51 [Kuehne et al., ICCV 2011]
• 6, 766 video clips from 51 action categories
• 3 splits for evaluation, each split has 70% training and 30% testing samples
51 action classes
Datasets
• UCF-101
• 13, 320 video clips from 101 action categories
• THUMOS13 challenge evaluation scheme with three training/testing splits
101 action classes
Implementation - ConvNet Training
• Spatial Net
1. UCF-101 first split → resize frame to 256x256 → rand. crop 224x224 → rand. horizontal flip
2. Pre-train the network with publicly available model from Chatfield et al. (BMVC 2014)
3. Fine tune the model parameters on the UCF101 dataset (full dataset)
• Temporal Net
1. 3D volume → resize to 256x256x.. → rand. crop 224x224x20 → rand. horizontal flip →
selection of 10 frames (for performance and efficiency balancing)
2. Train temporal net on UCF101 from scratch
3. High dropout ratio for FC6, FC7 for improve the generalization capacity of trained model
(Training Dataset is relatively small !!)
Implementation – Feature Encoding
• Used fisher vector (FV) [Sanchez et al., IJCV 2013]
• GMM clusters K = 256
• PCA to reduce dimensionality D, FV is 2𝐾𝐷 where 𝐷 is feature (vector) dimension!!
• Linear SVM as the classifier (𝐶 = 100)
Experimental Results
• Shape is important!! See iDT vs. HOF+MBH
• Motion performance is better in 2-st. ConvNet
‒ See Temporal Net
• Early Conv. Layer is better for both Net
• Spatial Conv. 4+5 is slightly better for UCF-101
• Temporal Conv. 4+5 is better for HMDB51
• iDT can further boost the TDD
• 63.2% → 65.9% (HMDB51)
• 90.3% → 91.5% (UCF-101)
Additional Exploration Experiments
Performance with PCA
reduced dimension.
Comparison of different
normalization methods.
Performance on HMDB51
ConvNet Layer performance
• Conv1 and Conv2 are outputs of max pooling layers after convolution operations
• Conv3, Conv4 and Conv5 are outputs of RELU activations
• Observations: Earlier layers performs better than laters e.g. conv3 in Temporal ConvNet
Comparison with state-of-the-art
Similar results with one-stream CNN on UCF-101: 91.1% (Ma et al., arXiv 2016)
Conclusions
• An idea of exploiting 2D CNN models for action recognition
• Exploited raw image value and optical flow for model training
• Normalization of feature maps increase performance
• Single-trajectory features are good enough to achieve competitive perfm.
• Late Conv layers offers more discriminative features
• Handcrafted features can help to boost the feature performance
Few important information about TDD
• Spatial (pre-trained and fine-tuned) and Temporal model are available online
• Dense optical flow and trajectory code is also available online
• Ready-to-go main script (MatLab) for Linux is also available online
• For CNN the Caffe toolbox (Python) was used!!
Thank You

More Related Content

What's hot

motion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videosmotion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videos
shiva kumar cheruku
 
Arp zmp
Arp zmpArp zmp
Arp zmp
Abdul Arfan
 
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
ijsrd.com
 
Introductory Level of SLAM Seminar
Introductory Level of SLAM SeminarIntroductory Level of SLAM Seminar
Introductory Level of SLAM Seminar
Dong-Won Shin
 
Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...
NAVER Engineering
 
Depth estimation using deep learning
Depth estimation using deep learningDepth estimation using deep learning
Depth estimation using deep learning
University of Oklahoma
 
Background subtraction
Background subtractionBackground subtraction
Background subtraction
Shashank Dhariwal
 
Primal-Dual Coding to Probe Light Transport
Primal-Dual Coding to Probe Light TransportPrimal-Dual Coding to Probe Light Transport
Primal-Dual Coding to Probe Light Transport
Matthew O'Toole
 
IRJET - Dehazing of Single Nighttime Haze Image using Superpixel Method
IRJET -  	  Dehazing of Single Nighttime Haze Image using Superpixel MethodIRJET -  	  Dehazing of Single Nighttime Haze Image using Superpixel Method
IRJET - Dehazing of Single Nighttime Haze Image using Superpixel Method
IRJET Journal
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
Julián Tachella
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
Wanjin Yu
 
Deep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image EnhancementDeep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image Enhancement
Sean Moran
 
Semantic Line Detection and Its Applications
Semantic Line Detection and Its ApplicationsSemantic Line Detection and Its Applications
Semantic Line Detection and Its Applications
NAVER Engineering
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)
nikhilus85
 
Talk 2011-buet-perception-event
Talk 2011-buet-perception-eventTalk 2011-buet-perception-event
Talk 2011-buet-perception-event
Mahfuzul Haque
 
Video Object Segmentation in Videos
Video Object Segmentation in VideosVideo Object Segmentation in Videos
Video Object Segmentation in Videos
NAVER Engineering
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
YUNG-KUEI CHEN
 
(Paper Review)U-GAT-IT: unsupervised generative attentional networks with ada...
(Paper Review)U-GAT-IT: unsupervised generative attentional networks with ada...(Paper Review)U-GAT-IT: unsupervised generative attentional networks with ada...
(Paper Review)U-GAT-IT: unsupervised generative attentional networks with ada...
MYEONGGYU LEE
 
3D Shape and Indirect Appearance by Structured Light Transport
3D Shape and Indirect Appearance by Structured Light Transport3D Shape and Indirect Appearance by Structured Light Transport
3D Shape and Indirect Appearance by Structured Light Transport
Matthew O'Toole
 
Multi-phase-field simulations with OpenPhase
Multi-phase-field simulations with OpenPhaseMulti-phase-field simulations with OpenPhase
Multi-phase-field simulations with OpenPhase
PFHub PFHub
 

What's hot (20)

motion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videosmotion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videos
 
Arp zmp
Arp zmpArp zmp
Arp zmp
 
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
Introduction to Wavelet Transform and Two Stage Image DE noising Using Princi...
 
Introductory Level of SLAM Seminar
Introductory Level of SLAM SeminarIntroductory Level of SLAM Seminar
Introductory Level of SLAM Seminar
 
Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...
 
Depth estimation using deep learning
Depth estimation using deep learningDepth estimation using deep learning
Depth estimation using deep learning
 
Background subtraction
Background subtractionBackground subtraction
Background subtraction
 
Primal-Dual Coding to Probe Light Transport
Primal-Dual Coding to Probe Light TransportPrimal-Dual Coding to Probe Light Transport
Primal-Dual Coding to Probe Light Transport
 
IRJET - Dehazing of Single Nighttime Haze Image using Superpixel Method
IRJET -  	  Dehazing of Single Nighttime Haze Image using Superpixel MethodIRJET -  	  Dehazing of Single Nighttime Haze Image using Superpixel Method
IRJET - Dehazing of Single Nighttime Haze Image using Superpixel Method
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
 
Deep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image EnhancementDeep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image Enhancement
 
Semantic Line Detection and Its Applications
Semantic Line Detection and Its ApplicationsSemantic Line Detection and Its Applications
Semantic Line Detection and Its Applications
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)
 
Talk 2011-buet-perception-event
Talk 2011-buet-perception-eventTalk 2011-buet-perception-event
Talk 2011-buet-perception-event
 
Video Object Segmentation in Videos
Video Object Segmentation in VideosVideo Object Segmentation in Videos
Video Object Segmentation in Videos
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
 
(Paper Review)U-GAT-IT: unsupervised generative attentional networks with ada...
(Paper Review)U-GAT-IT: unsupervised generative attentional networks with ada...(Paper Review)U-GAT-IT: unsupervised generative attentional networks with ada...
(Paper Review)U-GAT-IT: unsupervised generative attentional networks with ada...
 
3D Shape and Indirect Appearance by Structured Light Transport
3D Shape and Indirect Appearance by Structured Light Transport3D Shape and Indirect Appearance by Structured Light Transport
3D Shape and Indirect Appearance by Structured Light Transport
 
Multi-phase-field simulations with OpenPhase
Multi-phase-field simulations with OpenPhaseMulti-phase-field simulations with OpenPhase
Multi-phase-field simulations with OpenPhase
 

Viewers also liked

Reading presentation
Reading presentationReading presentation
Reading presentation
Fiona Hunter - Room 4
 
Benefits of reading books
Benefits of reading booksBenefits of reading books
Benefits of reading books
Vinit Shahdeo
 
IE Presentation on the Benefits of Reading
IE Presentation on the Benefits of ReadingIE Presentation on the Benefits of Reading
IE Presentation on the Benefits of Reading
devaratth
 
Reading Skills
Reading SkillsReading Skills
Reading Skills
Magda Castro
 
Reading skills
Reading skillsReading skills
Reading skills
alkaala
 
The Reading Skills
The Reading SkillsThe Reading Skills
The Reading Skills
Fernan Lopez
 

Viewers also liked (6)

Reading presentation
Reading presentationReading presentation
Reading presentation
 
Benefits of reading books
Benefits of reading booksBenefits of reading books
Benefits of reading books
 
IE Presentation on the Benefits of Reading
IE Presentation on the Benefits of ReadingIE Presentation on the Benefits of Reading
IE Presentation on the Benefits of Reading
 
Reading Skills
Reading SkillsReading Skills
Reading Skills
 
Reading skills
Reading skillsReading skills
Reading skills
 
The Reading Skills
The Reading SkillsThe Reading Skills
The Reading Skills
 

Similar to Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)

Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptx
computerscience98
 
med_poster_spie
med_poster_spiemed_poster_spie
med_poster_spie
Joe Robinson
 
Temporal Segment Network
Temporal Segment NetworkTemporal Segment Network
Temporal Segment Network
Dongang (Sean) Wang
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
[2018 台灣人工智慧學校校友年會] 視訊畫面生成 / 林彥宇
[2018 台灣人工智慧學校校友年會] 視訊畫面生成 / 林彥宇[2018 台灣人工智慧學校校友年會] 視訊畫面生成 / 林彥宇
[2018 台灣人工智慧學校校友年會] 視訊畫面生成 / 林彥宇
台灣資料科學年會
 
161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMs161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMs
Junho Cho
 
John W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final PresentationJohn W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final Presentation
John Vinti
 
Video Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 datasetVideo Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 dataset
Giorgio Carbone
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Simone Ercoli
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Fadwa Fouad
 
ntroduction of Signal such as sinosoidal signals, definition of signals
ntroduction of Signal such as sinosoidal signals, definition of signalsntroduction of Signal such as sinosoidal signals, definition of signals
ntroduction of Signal such as sinosoidal signals, definition of signals
DrAjayKumarYadav4
 
final_project_1_2k21cse07.pptx
final_project_1_2k21cse07.pptxfinal_project_1_2k21cse07.pptx
final_project_1_2k21cse07.pptx
shwetabhagat25
 
Deep learning fundamental and Research project on IBM POWER9 system from NUS
Deep learning fundamental and Research project on IBM POWER9 system from NUSDeep learning fundamental and Research project on IBM POWER9 system from NUS
Deep learning fundamental and Research project on IBM POWER9 system from NUS
Ganesan Narayanasamy
 
On the Influence Propagation of Web Videos
On the Influence Propagation of Web VideosOn the Influence Propagation of Web Videos
On the Influence Propagation of Web Videos
abidhavp
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
Motaz El-Saban
 
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondenceParn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
NAVER Engineering
 
Silhouette analysis based action recognition via exploiting human poses
Silhouette analysis based action recognition via exploiting human posesSilhouette analysis based action recognition via exploiting human poses
Silhouette analysis based action recognition via exploiting human poses
AVVENIRE TECHNOLOGIES
 
ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)
Chun-Hao Huang
 
Large-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docxLarge-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docx
croysierkathey
 

Similar to Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD) (20)

Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptx
 
med_poster_spie
med_poster_spiemed_poster_spie
med_poster_spie
 
Temporal Segment Network
Temporal Segment NetworkTemporal Segment Network
Temporal Segment Network
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
[2018 台灣人工智慧學校校友年會] 視訊畫面生成 / 林彥宇
[2018 台灣人工智慧學校校友年會] 視訊畫面生成 / 林彥宇[2018 台灣人工智慧學校校友年會] 視訊畫面生成 / 林彥宇
[2018 台灣人工智慧學校校友年會] 視訊畫面生成 / 林彥宇
 
161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMs161209 Unsupervised Learning of Video Representations using LSTMs
161209 Unsupervised Learning of Video Representations using LSTMs
 
John W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final PresentationJohn W. Vinti Particle Tracker Final Presentation
John W. Vinti Particle Tracker Final Presentation
 
Video Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 datasetVideo Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 dataset
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
 
ntroduction of Signal such as sinosoidal signals, definition of signals
ntroduction of Signal such as sinosoidal signals, definition of signalsntroduction of Signal such as sinosoidal signals, definition of signals
ntroduction of Signal such as sinosoidal signals, definition of signals
 
final_project_1_2k21cse07.pptx
final_project_1_2k21cse07.pptxfinal_project_1_2k21cse07.pptx
final_project_1_2k21cse07.pptx
 
Deep learning fundamental and Research project on IBM POWER9 system from NUS
Deep learning fundamental and Research project on IBM POWER9 system from NUSDeep learning fundamental and Research project on IBM POWER9 system from NUS
Deep learning fundamental and Research project on IBM POWER9 system from NUS
 
On the Influence Propagation of Web Videos
On the Influence Propagation of Web VideosOn the Influence Propagation of Web Videos
On the Influence Propagation of Web Videos
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
 
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondenceParn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
 
Silhouette analysis based action recognition via exploiting human poses
Silhouette analysis based action recognition via exploiting human posesSilhouette analysis based action recognition via exploiting human poses
Silhouette analysis based action recognition via exploiting human poses
 
ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)
 
Large-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docxLarge-scale Video Classification with Convolutional Neural Net.docx
Large-scale Video Classification with Convolutional Neural Net.docx
 

Recently uploaded

Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
simonomuemu
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
NgcHiNguyn25
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
chanes7
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
paigestewart1632
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 

Recently uploaded (20)

Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 

Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)

  • 1. ViPr Reading Group Meeting# 02 AI&IC Lab, FCI MMU Cyberjaya 02/05/2016
  • 2. My Introduction Saimunur Rahman Graduate Research Assistant (JS) Centre of Visual Computing Multimedia University, Cyberjaya Campus Facebook: fb.me/saimunur.rahman Web: http://saimunur.github.io
  • 3. Today’s Agenda • Talk on “Action Recognition with Trajectory-Pooled Deep- Convolutional Descriptors” by L. Wang, Y. Qiao, and X. Tang published at CVPR 2015. Lets begin with some vision-based action recognition basics
  • 4. Action Recognition Machine interpretation of human actions in video
  • 5. Inherent Complexity A single activity can be performed in many ways Video Source: YouTube
  • 6. Many degrees of freedom Large capability set, 206 bones ans 230 joints in total!! Video Source: YouTube
  • 7. View specific Same activity can be viewed differently Video Source: UCF-YouTube (Liu et al. 2009)
  • 8. Subject dependency Same activities can be performed differently by different people Video Source: UCF-YouTube (Liu et al. 2009)
  • 9. Why it is even relevant? Video source: YouTube
  • 10. Research Trend in Action Recognition • Hand-crafted method ‒ Holistic or global method ‒ Localized method • Unsupervised method ‒ Deep learning • Fusion
  • 11. Action Recognition with Trajectory- Pooled Deep-Convolutional Descriptors Limin Wang1,2,Yu Qiao2, Xiaoou Tang1,2 1The Chinese University of Hong Kong 2Shenzhen Institutes of Advanced Technology CVPR 2015 Poster Total Citation : 40 (18 in 2016)
  • 12. Main Idea • Utilize deep architectures to learn conv. feature maps • Apply trajectory based pooling to aggregate conv. features into effective descriptors • Aims to combine the benefits of both hand-crafted and deep- learned features.
  • 13. Motivations • Hand-crafted methods are lack of discriminative capacity • Current deep learning do not differentiate between spatial and temporal domain ‒ Treat temporal dimension as feature channels when image trained ConvNet is use to model videos
  • 14. Contributions • Modified Two-stream CNN model [Simonyan and Zisserman, NIPS 2014] trained on UCF-101 [Soomro et al., CoRR 2012] • Two CNN normalization method • Thorough evaluation of later Convolution layers (Conv. 3,4,5) • Multi-scale extension
  • 16. Trajectory extraction • Used improved dense trajectory (iDT) [Wang et al., ICCV 13] • Camera motion removal ‒ Compute optical flow ‒ Homography estimation using RANSAC [Fischler & Bolles. 1981] ‒ SURF and Optical flow (OF) for similarity between two frames ‒ Re-compute the optical flow – warped flow • Trajectory estimation ‒ Trajectories using dense trajectories [Wang et al. 11] ‒ Track points with original spatial scale (results 2-3% less than multi-scale) [Wang et al. 11] Image reproduced from Wang et al. 2013 Trajectory detectionInput video Input video source: YouTube
  • 17. Feature map extraction Two-stream network [Simonyan and Zisserman, NIPS 2014], Use CNN-M-2048 model [Chatfield et al, BMVC 2014] Proposed network model of both spatial and temporal stream
  • 18. Feature map extraction (2) • Spatial-net: frame-by-frame • Temporal-net: stack optical flow volume (one frame is replicated) • Trajectory mapping: ‒ Zero-padding 𝑘/2 , 𝑘 is kernel size in conv and pooling ‒ Trajectory point mapping: (𝑥, 𝑦, 𝑡) → (𝑟 ∗ 𝑥, 𝑟 ∗ 𝑦, 𝑡), 𝑟 is feature map ratio w.r.t input image
  • 19. Trajectory pooled descriptor (TDD) • Local trajectory-aligned descriptor computed in a 3D volume around the trajectory. • The size is 𝑁 × 𝑁 × 𝑃 where, 𝑁 is spatial size and 𝑃 is traj. Length. • Feature Normalization (Ensure everything is in same range and equ. Cont.) ‒ Spatiotemporal Normalization: 𝐶𝑠𝑡(𝑥, 𝑦, 𝑡, 𝑛) = 𝐶(𝑥, 𝑦, 𝑡, 𝑛)/𝑚𝑎𝑥 𝑥,𝑦,𝑡(𝑥, 𝑦, 𝑡, 𝑛) ‒ Channel Normalization: 𝐶𝑠𝑡(𝑥, 𝑦, 𝑡, 𝑛) = 𝐶(𝑥, 𝑦, 𝑡, 𝑛)/𝑚𝑎𝑥 𝑛(𝑥, 𝑦, 𝑡, 𝑛) • TDD estimation is done by sum-pooling normalized channels over the trajectory: 𝐷 𝑇𝑘, 𝐶 𝑚 = 𝑝=1 𝑃 𝐶 𝑚 𝑟 𝑚 × 𝑥 𝑝 𝑘 , 𝑟 𝑚 × 𝑦𝑝 𝑘 , 𝑡 𝑘
  • 20. Multi-scale TDD 1. Multi-scale pyramid representations of video frames and optical flow fields. 2. Pyramid representations are fed into the two stream ConvNets for multi- scale feature map 3. Calculate multi-scale TDD: (𝑥, 𝑦, 𝑡) → (𝑟 𝑚 × 𝑠 × 𝑥, 𝑟 𝑚 × 𝑠 × 𝑦, 𝑡), 𝑠 is the scale of features and 𝑠 = 1 2 , 1 2 , 1, 2, 2 Spatial net pyramid Temporal net pyramid
  • 21. Datasets • HMDB51 [Kuehne et al., ICCV 2011] • 6, 766 video clips from 51 action categories • 3 splits for evaluation, each split has 70% training and 30% testing samples 51 action classes
  • 22. Datasets • UCF-101 • 13, 320 video clips from 101 action categories • THUMOS13 challenge evaluation scheme with three training/testing splits 101 action classes
  • 23. Implementation - ConvNet Training • Spatial Net 1. UCF-101 first split → resize frame to 256x256 → rand. crop 224x224 → rand. horizontal flip 2. Pre-train the network with publicly available model from Chatfield et al. (BMVC 2014) 3. Fine tune the model parameters on the UCF101 dataset (full dataset) • Temporal Net 1. 3D volume → resize to 256x256x.. → rand. crop 224x224x20 → rand. horizontal flip → selection of 10 frames (for performance and efficiency balancing) 2. Train temporal net on UCF101 from scratch 3. High dropout ratio for FC6, FC7 for improve the generalization capacity of trained model (Training Dataset is relatively small !!)
  • 24. Implementation – Feature Encoding • Used fisher vector (FV) [Sanchez et al., IJCV 2013] • GMM clusters K = 256 • PCA to reduce dimensionality D, FV is 2𝐾𝐷 where 𝐷 is feature (vector) dimension!! • Linear SVM as the classifier (𝐶 = 100)
  • 25. Experimental Results • Shape is important!! See iDT vs. HOF+MBH • Motion performance is better in 2-st. ConvNet ‒ See Temporal Net • Early Conv. Layer is better for both Net • Spatial Conv. 4+5 is slightly better for UCF-101 • Temporal Conv. 4+5 is better for HMDB51 • iDT can further boost the TDD • 63.2% → 65.9% (HMDB51) • 90.3% → 91.5% (UCF-101)
  • 26. Additional Exploration Experiments Performance with PCA reduced dimension. Comparison of different normalization methods. Performance on HMDB51
  • 27. ConvNet Layer performance • Conv1 and Conv2 are outputs of max pooling layers after convolution operations • Conv3, Conv4 and Conv5 are outputs of RELU activations • Observations: Earlier layers performs better than laters e.g. conv3 in Temporal ConvNet
  • 28. Comparison with state-of-the-art Similar results with one-stream CNN on UCF-101: 91.1% (Ma et al., arXiv 2016)
  • 29. Conclusions • An idea of exploiting 2D CNN models for action recognition • Exploited raw image value and optical flow for model training • Normalization of feature maps increase performance • Single-trajectory features are good enough to achieve competitive perfm. • Late Conv layers offers more discriminative features • Handcrafted features can help to boost the feature performance
  • 30. Few important information about TDD • Spatial (pre-trained and fine-tuned) and Temporal model are available online • Dense optical flow and trajectory code is also available online • Ready-to-go main script (MatLab) for Linux is also available online • For CNN the Caffe toolbox (Python) was used!!