Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)

[course site]
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Video Analysis
Day 2 Lecture 2
#DLUPC

Acknowledgments
2
Víctor Campos Alberto Montes

Outline
1. Recognition
2. Optical Flow
3. Object Tracking
3

Recognition
Demo: Clarifai
MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015)
4

Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with
convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.
5
Recognition

6
(Slides by Dídac Surís) Abu-El-Haija, Sami, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra
Vijayanarasimhan. "Youtube-8m: A large-scale video classification benchmark." arXiv preprint arXiv:1609.08675 (2016). [project]
Activity Recognition: Datasets

7
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D
convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

8
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." ICCV 2015

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L.
Large-scale video classification with convolutional neural networks. CVPR 2014
Slides extracted from ReadCV seminar by Victor Campos 9
Recognition: DeepVideo

10
Recognition: DeepVideo: Demo
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014

11
Recognition: DeepVideo: Architectures

12
Recognition: DeepVideo: Features

13
Recognition: DeepVideo: Multiscale
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks.
CVPR 2014

14
Recognition: DeepVideo: Results
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks.
CVPR 2014

15
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Activity Recognition: Frames + LSTM

16
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and
George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
Activity Recognition: Frames + Optical Flow + LSTM

17
Recognition
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." ICCV 2015

18
Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in
videos." NIPS 2014.
Recognition: Two stream
Two CNNs in paralel:
● One for RGB images
● One for Optical flow (hand-crafted features)
Fusion after the softmax layer

19Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code]
Recognition: Two stream
Two CNNs in paralel:
● One for RGB images
● One for Optical flow (hand-crafted features)
Fusion at a convolutional layer

20
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D
convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

21
Recognition: C3D
spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International
Conference on Computer Vision, pp. 4489-4497. 2015

22
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Demo

23
Recognition: C3D: Temporal dimension
3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets
Temporal depth
2D ConvNets

24
A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best
performing architectures for 3D ConvNets
Recognition: C3D: Temporal dimension

25
Recognition: C3D: Architecture
Feature
vector

26
Recognition: C3D: Feature vector
16-frame clip
16-frame clip
16-frame clip
16-frame clip
...
Average
4096-dimvideodescriptor
4096-dimvideodescriptor
L2 norm

BSc
thesis
http://activity-net.org/ 27
Temporal Activity Detection

BSc
thesis
Videos
Activity Classification
Longboarding
28

BSc
thesis
Videos
Activity Temporal Localization
Longboarding
29

30
Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.
(Slidecast and Slides by Alberto Montes)
Recognition: Localization

31
(Slidecast and Slides by Alberto Montes) Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in
untrimmed videos via multi-stage cnns." CVPR 2016.

32
(Slidecast and Slides by Alberto Montes) Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in
untrimmed videos via multi-stage cnns." CVPR 2016.

BSc
thesis
Neural Network
Activity
33

BSc
thesis
Activity
CNN RNN+
34

spatiotemporal features with 3D convolutional networks." CVPR 2015
3D Convolutions over sets of 16 frames...
35

BSc
thesis
36

BSc
thesis
mAP = 0.5938 mAP = 0.5492 mAP = 0.5635
Deeper networks present overfitting
37

BSc
thesis
38

BSc
thesis
39

BSc
thesis
40

BSc
thesis
Ground Truth:
Playing water polo
Prediction:
0.765 Playing water polo
0.202 Swimming
0.007 Springboard diving
41

BSc
thesis
Ground Truth:
Hopscotch
Prediction:
0.848 Running a marathon
0.023 Triple jump
0.022 Javelin throw
42

BSc
thesis
43
A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal
Activity Detection in Untrimmed Videos with Recurrent Neural Networks”, in 1st
NIPS Workshop on Large Scale Computer Vision Systems 2016 (best poster award)

44
A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal
Activity Detection in Untrimmed Videos with Recurrent Neural Networks”, in 1st
NIPS Workshop on Large Scale Computer Vision Systems 2016 (best poster award)

Outline
1. Recognition
2. Optical Flow
3. Object Tracking
45

Optical Flow
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 46

Optical Flow: FlowNet
Optical Flow With Convolutional Networks. ICCV 2015 47

End to end supervised learning of optical flow.

Optical Flow: FlowNet (contracting)
Option A: Stack both input images together and feed them through a generic
network.

Option B: Create two separate, yet identical processing streams for the two images
and combine them at a later stage.

Correlation layer:
Convolution of data patches from the layers to combine.
Option B: Create two separate, yet identical processing streams for the two images
and combine them at a later stage.

Optical Flow: FlowNet (expanding)
Upconvolutional layers: Unpooling features maps + convolution.
Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.

Optical Flow With Convolutional Networks. ICCV 2015 53
Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset is
generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise; changes in
brightness, contrast, gamma and color).
Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI.
Data
augmentation

Outline
1. Recognition
2. Optical Flow
3. Object Tracking
55

Object tracking: Deep but not CNN
56
Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking."
NIPS 2013.
Offline learning: Robust and generic features are learning by training a stacked denoising autoencoder on
auxiliary images.
Online learning: Encoder part of the autoencode + classification neural network

Object tracking: MDNet
57
Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

Object tracking: MDNet: Architecture
58
Domain-specific layers are used during training for each sequence, but are replaced by a single one at test
time.

Object tracking: MDNet: Online update
59
MDNet is updated online at test
time with hard negative mining,
that is, selecting negative
samples with the highest positive
score.

Object tracking: FCNT
60
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015 [code]

Object tracking: FCNT
61
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." CVPR 2015 [code]
Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification.
conv4-3 conv5-3

Object tracking: FCNT: Specialization
62
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking
sequence.

Object tracking: FCNT: Localization
63
Although trained for image classification, feature maps in conv5-3 enable object localization…
...but is not discriminative enough to different objects of the same category.

Object tracking: Localization
64
Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015.
Other works have shown how features maps in convolutional layers allow object localization.

Object tracking: FCNT: Localization
65
On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation…
conv4-3 conv5-3

Object tracking: FCNT: Architecture
66
SNet=Specific Network (online update)
GNet=General Network (fixed)

Object tracking: FCNT: Results
67

Object tracking: ROLO
68
Ning, Guanghan, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, and Haohong Wang. "Spatially Supervised Recurrent Convolutional Neural
Networks for Visual Object Tracking." IEEE International Symposium on Circuits and Systems, 2017

Object tracking: ROLO
69
Ning, Guanghan, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, and Haohong Wang. "Spatially Supervised Recurrent Convolutional Neural
Networks for Visual Object Tracking." arXiv preprint arXiv:1607.05781 (2016)

Object tracking: CFNet
70
Valmadre, Jack, Luca Bertinetto, João F. Henriques, Andrea Vedaldi, and Philip HS Torr. "End-to-end
representation learning for Correlation Filter based tracking." CVPR 2017

Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)

Similar to Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision) (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)