The document provides an overview of video analysis techniques including recognition, optical flow, and object tracking. For recognition, it discusses approaches using convolutional neural networks like DeepVideo that perform classification on frames. It also covers models using optical flow as input like two-stream networks as well as 3D CNNs like C3D that directly learn spatiotemporal features. For optical flow, it summarizes FlowNet which uses a CNN to learn optical flow end-to-end. And for object tracking, it mentions deep learning methods like MDNet that train domain-specific layers to generalize across sequences.
5. Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with
convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.
5
Recognition
6. 6
(Slides by Dídac Surís) Abu-El-Haija, Sami, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra
Vijayanarasimhan. "Youtube-8m: A large-scale video classification benchmark." arXiv preprint arXiv:1609.08675 (2016). [project]
Activity Recognition: Datasets
7. 7
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D
convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
8. 8
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." ICCV 2015
9. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L.
Large-scale video classification with convolutional neural networks. CVPR 2014
Slides extracted from ReadCV seminar by Victor Campos 9
Recognition: DeepVideo
10. 10
Recognition: DeepVideo: Demo
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014
11. 11
Recognition: DeepVideo: Architectures
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014
12. 12
Recognition: DeepVideo: Features
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014
13. 13
Recognition: DeepVideo: Multiscale
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks.
CVPR 2014
14. 14
Recognition: DeepVideo: Results
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks.
CVPR 2014
15. 15
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Activity Recognition: Frames + LSTM
16. 16
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and
George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
Activity Recognition: Frames + Optical Flow + LSTM
17. 17
Recognition
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." ICCV 2015
18. 18
Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in
videos." NIPS 2014.
Recognition: Two stream
Two CNNs in paralel:
● One for RGB images
● One for Optical flow (hand-crafted features)
Fusion after the softmax layer
19. 19Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code]
Recognition: Two stream
Two CNNs in paralel:
● One for RGB images
● One for Optical flow (hand-crafted features)
Fusion at a convolutional layer
20. 20
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D
convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
21. 21
Recognition: C3D
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International
Conference on Computer Vision, pp. 4489-4497. 2015
22. 22
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Demo
23. 23
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Temporal dimension
3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets
Temporal depth
2D ConvNets
24. 24
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best
performing architectures for 3D ConvNets
Recognition: C3D: Temporal dimension
25. 25
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Architecture
Feature
vector
26. 26
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Feature vector
16-frame clip
16-frame clip
16-frame clip
16-frame clip
...
Average
4096-dimvideodescriptor
4096-dimvideodescriptor
L2 norm
30. 30
Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.
(Slidecast and Slides by Alberto Montes)
Recognition: Localization
31. 31
Recognition: Localization
(Slidecast and Slides by Alberto Montes) Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in
untrimmed videos via multi-stage cnns." CVPR 2016.
32. 32
Recognition: Localization
(Slidecast and Slides by Alberto Montes) Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in
untrimmed videos via multi-stage cnns." CVPR 2016.
35. Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." CVPR 2015
3D Convolutions over sets of 16 frames...
35
Temporal Activity Detection
43. BSc
thesis
43
Temporal Activity Detection
A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal
Activity Detection in Untrimmed Videos with Recurrent Neural Networks”, in 1st
NIPS Workshop on Large Scale Computer Vision Systems 2016 (best poster award)
44. 44
Temporal Activity Detection
A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal
Activity Detection in Untrimmed Videos with Recurrent Neural Networks”, in 1st
NIPS Workshop on Large Scale Computer Vision Systems 2016 (best poster award)
46. Optical Flow
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 46
47. Optical Flow: FlowNet
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. ICCV 2015 47
48. Optical Flow: FlowNet
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 48
End to end supervised learning of optical flow.
49. Optical Flow: FlowNet (contracting)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 49
Option A: Stack both input images together and feed them through a generic
network.
50. Optical Flow: FlowNet (contracting)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 50
Option B: Create two separate, yet identical processing streams for the two images
and combine them at a later stage.
51. Optical Flow: FlowNet (contracting)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 51
Correlation layer:
Convolution of data patches from the layers to combine.
Option B: Create two separate, yet identical processing streams for the two images
and combine them at a later stage.
52. Optical Flow: FlowNet (expanding)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 52
Upconvolutional layers: Unpooling features maps + convolution.
Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.
53. Optical Flow: FlowNet
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. ICCV 2015 53
Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset is
generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise; changes in
brightness, contrast, gamma and color).
Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI.
Data
augmentation
54. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 54
Optical Flow: FlowNet
56. Object tracking: Deep but not CNN
56
Wang, Naiyan, and Dit-Yan Yeung. "Learning a deep compact image representation for visual tracking."
NIPS 2013.
Offline learning: Robust and generic features are learning by training a stacked denoising autoencoder on
auxiliary images.
Online learning: Encoder part of the autoencode + classification neural network
57. Object tracking: MDNet
57
Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
58. Object tracking: MDNet: Architecture
58
Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
Domain-specific layers are used during training for each sequence, but are replaced by a single one at test
time.
59. Object tracking: MDNet: Online update
59
Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
MDNet is updated online at test
time with hard negative mining,
that is, selecting negative
samples with the highest positive
score.
61. Object tracking: FCNT
61
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." CVPR 2015 [code]
Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification.
conv4-3 conv5-3
62. Object tracking: FCNT: Specialization
62
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking
sequence.
63. Object tracking: FCNT: Localization
63
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
Although trained for image classification, feature maps in conv5-3 enable object localization…
...but is not discriminative enough to different objects of the same category.
64. Object tracking: Localization
64
Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015.
Other works have shown how features maps in convolutional layers allow object localization.
65. Object tracking: FCNT: Localization
65
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation…
conv4-3 conv5-3
66. Object tracking: FCNT: Architecture
66
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
SNet=Specific Network (online update)
GNet=General Network (fixed)
67. Object tracking: FCNT: Results
67
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
68. Object tracking: ROLO
68
Ning, Guanghan, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, and Haohong Wang. "Spatially Supervised Recurrent Convolutional Neural
Networks for Visual Object Tracking." IEEE International Symposium on Circuits and Systems, 2017
70. Object tracking: CFNet
70
Valmadre, Jack, Luca Bertinetto, João F. Henriques, Andrea Vedaldi, and Philip HS Torr. "End-to-end
representation learning for Correlation Filter based tracking." CVPR 2017