These slides summarize the main trends in deep neural networks for video encoding. Including single frame models, spatiotemporal convolutionals, long term sequence modeling with RNNs and their combinaction with optical flow.
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
Neural Architectures for Video Encoding
1. [http://pagines.uab.cat/mcv/]
Module 6 - Day 2 - Lecture 1
Neural Architectures for
Video Encoding
27th February 2020
Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
3. Video lectures
3
Víctor Campos
Deep Learning for Computer Vision 2018
UPC TelecomBCN
Xavier Giro-i-Nieto
Deep Learning for Computer Vision 2019
UPC TelecomBCN
4. 4
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale
video classification with convolutional neural networks. CVPR 2014
5. Motivation
5
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. Large-scale video classification
with convolutional neural networks. CVPR 2014.
6. What is a video?
6
● Formally, a video is a 3D signal
○ Spatial coordinates: x, y
○ Temporal coordinate: t
● If we fix t, we obtain an image. We can understand
videos as sequences of images (a.k.a. frames)
7. Convolutional Neural Networks (CNN) provide state of the
art performance on still image analysis tasks
How do we work with images?
7
8. How can we extend CNNs to image sequences?
How do we work with videos ?
8
f ŷ
9. 9
Basic deep architectures for video:
1. Single frame models
2. Spatio-temporal convolutions
3. CNN + RNN
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
10. 10
Basic deep architectures for video:
1. Single frame models
2. Spatio-temporal convolutions
3. CNN + RNN
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
11. Single frame models
11
CNN CNN CNN...
Combination method
Combination is commonly
implemented as a small NN on
top of a temporal pooling
operation (e.g. max, average).
Problem: pooling is not aware
of the temporal order!
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and
George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
ŷ
12. 12
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale
video classification with convolutional neural networks. CVPR 2014
Single frame models
13. 13
Single frame models
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale
video classification with convolutional neural networks. CVPR 2014
14. 14
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale
video classification with convolutional neural networks. CVPR 2014
Single frame models
15. 15
Basic deep architectures for video:
1. Single frame models
2. Spatio-temporal convolutions
3. CNN + RNN
4. RGB + Optical Flow Two-stream CNN
Deep Video Architectures
16. 3D CNN (C3D)
16
We can add an extra dimension to standard 2D CNN filters:
● A grayscale video clip may be represented by a tensor of size H x W x L.
● A 3D conv filter for the 1st conv layer may be of size k x k x d
#C3D Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features
with 3D convolutional networks" ICCV 2015
time
17. 17
The video needs to be split into chunks (also known as clips) with a number of
frames that fits the receptive field of the C3D (eg. 16 frames).
#C3D Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features
with 3D convolutional networks" ICCV 2015
16-frame clip
16-frame clip
16-frame clip
...
Average
pooling
4096-dimvideodescriptor
4096-dimvideodescriptor
L2 norm
3D CNN (C3D) + temporal pooling
18. 18
3D CNN (C3D)
#3D Resnet Hara, Kensho, Hirokatsu Kataoka, and Yutaka Satoh. "Can spatiotemporal 3d cnns retrace the history of 2d
cnns and imagenet?." CVPR 2018. [code]
Much larger datasets (eg. Kinetics) allow training much deeper C3D CNNs.
20. 20
Inflated 3D weights (I3D)
We can adapt C2D filters to C3D by replicating them along the temporal axis
(and scaling its values by 1/d).
inflation
#I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
k x k
C2D filter I3D filter
k x k x d
Filter inflation allows reusing C2D pre-trained filters (eg. from ImageNet) to
initialize C3D filters.
21. 21
Residual Learning
Residual learning: reformulate the layers as learning residual functions with
respect to the identity F(x), instead of learning unreferenced functions H(x)
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." CVPR 2016. [slides]
Plain Net Residual Net
22. 22
#Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture
search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code]
3D CNN + Residual (Res3D)
2D ResNet Res3D
23. 23
#Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture
search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code]
3D CNN + Residual (Res3D)
24. 24
#Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture
search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code]
3D CNN + Residual (Res3D)
Gain in performance comes with the cost of much more computation and larger
memory footprint.
25. 25
#MC Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A closer
look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
Mixed Convolutions (MC & rMC)
Mixed Convolutions (MC)
● Early layers: 3D convolutions
● Top layers: 2D convolutions
Reversed Mixed Convolutions (rMC)
● Early layers: 2D convolutions
● Top layers: 3D convolutions
26. 26
#MC Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A closer
look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
MC ResNets:
● 3-4% gain in clip-level accuracy over 2D ResNets of comparable capacity
● match the performance of 3D ResNets, which have 3 times as many parameters.
Mixed Convolutions (MC & rMC)
27. 27
#R(2+1)D Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A
closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
C(2+1)D
Splits the computation into a spatial 2D convolution followed by a temporal 1D
convolution:. In the case of an input of a single feature channel:
C3D C(2+1)D
Number of 2D filters
28. 28
#R(2+1)D Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A
closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
C(2+1D) + Residual
29. 29
#R(2+1)D Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A
closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
2+1D CNN + Residual
30. 30
Separable 2D (group) convolutions
Depth-wise (in channels) separable convolutions.
#XCeption Chollet, François. "Xception: Deep learning with depthwise separable convolutions." CVPR 2017.
Figure: Sik-Ho Tsang
31. 31
Res3D + Separable (group) convolutions
#CSN D. Tran, H. Wang, L. Torresani and M. Feiszli. Video Classification with Channel-Separated
Convolutional Networks. ICCV 2019. [Caffe2]
Channel interactions and spatiotemporal interactions are separated.
1 group 2 groups
# groups =
# input channels
32. 32
Basic deep architectures for video:
1. Single frame models
2. Spatio-temporal convolutions
3. CNN + RNN
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
33. 33
The hidden layers and the
output depend from previous
states of the hidden layers
Recurrent Neural Network (RNN)
34. 2D CNN + RNN
34
CNN CNN CNN...
Recurrent Neural Networks
are well suited for processing
sequences.
Problem: RNNs are sequential
and cannot be parallelized.
RNN RNN RNN...
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Videolectures on RNNs:
DLSL 2017, “RNN (I)”
“RNN (II)”
DLAI 2018, “RNN”
35. 35
#Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture
search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code]
2D CNN + RNN vs Spatio-temporal Convs
36. 36
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. [code]
2D CNN + RNN
37. 37
Limitations:
● How can we handle longer videos?
● How can we capture longer temporal dependencies?
3D CNN
38. 38
3D CNN + RNN
A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in Untrimmed
Videos with Recurrent Neural Networks”, NIPS Workshop 2016 (best poster award)
39. 39
#SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip
RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
2D CNN + RNN: skip redundancy
CNN CNN CNN...
RNN RNN RNN... After processing a frame, let
the RNN decide how many
future frames can be skipped
In skipped frames, simply
copy the output and state
from the previous time step
There is no ground truth for
which frames can be skipped.
The RNN learns it by itself
during training!
40. 40
#SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip
RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
2D CNN + RNN
Epochs for Skip LSTM (λ = 10-4
)
~30% acc ~50% acc ~70% acc ~95% acc
11 epochs 51 epochs 101 epochs 400 epochs
Used
Unused
41. 41
Used
Unused
2D CNN + RNN: skip redundancy
#SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip
RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
42. 42
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
43. 43
Ilg, Eddy, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. "Flownet 2.0: Evolution of
optical flow estimation with deep networks." CVPR 2017. [code]
44. Optical flow
44
Popular implementations for optical flow:
#Coarse2Fine Brox, Thomas, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. "High accuracy optical flow estimation
based on a theory for warping." ECCV 2004.
● PyFlow
● FlowNet2
● Improved Dense Trajectories - IDT
● Lucas-Kanade (OpenCV)
● Farneback 2003 (OpenCV)
● Middlebury
● Horn and schunk
● Tikhonov regularized and vectorized
● DeepFlow (2013)
● (...)
45. Optical flow
45
Deep Neural Networks have actually also been trained to predict optical flow:
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., FlowNet:
Learning Optical Flow With Convolutional Networks. ICCV 2015
46. Two-streams 2D CNNs
46
Problem: Single frame models do not take into account motion in videos.
Solution: extract optical flow for a stack of frames and use it as an input to a
CNN.
Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in
videos." NIPS 2014.
Fusion
47. Two-streams 2D CNNs
47Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action
recognition." CVPR 2016. [code]
Fusion
Problem: Single frame models do not take into account motion in videos.
Solution: extract optical flow for a stack of frames and use it as an input to a
CNN.
48. Two-streams 2D CNNs + Residual
48
Feichtenhofer, Christoph, Axel Pinz, and Richard Wildes. "Spatiotemporal residual networks for video action recognition."
NIPS 2016. [code]
Problem: Single frame models do not take into account motion in videos.
Solution: Extract optical flow for a stack of frames and feed it features in the
appearance stream.
49. Two-streams + Inflated 3D
49#I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
50. Two-streams 3D CNNs
50#I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
51. Two-streams 2D CNNs + RNN
51
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici.
"Beyond short snippets: Deep networks for video classification." CVPR 2015
52. One stream RGB + Optical Flow
52
#2in1 Zhao, Jiaojiao, and Cees GM Snoek. "Dance with Flow: Two-in-One Stream Action Detection." CVPR 2019.
Leveraging the motion condition to modulate RGB features improves
detection accuracy.
53. One stream RGB + Optical Flow
53
#2in1 Zhao, Jiaojiao, and Cees GM Snoek. "Dance with Flow: Two-in-One Stream Action Detection." CVPR 2019.
54. One stream RGB + Optical Flow
54
#2in1 Zhao, Jiaojiao, and Cees GM Snoek. "Dance with Flow: Two-in-One Stream Action Detection." CVPR 2019.
2in1 has:
● half the computation and parameters of a two-stream equivalent
● better action detection accuracy (but worse for action classification).
55. Two-streams (feature visualization)
55
Feichtenhofer, Christoph, Axel Pinz, Richard P. Wildes, and Andrew Zisserman. "What have we learned from deep
representations for action recognition?." CVPR 2018.
Feichtenhofer, Christoph, Axel Pinz, Richard P. Wildes, and Andrew Zisserman. "Deep Insights into Convolutional Networks
for Video Recognition." IJCV 2019.
56. Replacing C3D for C2D
56#S3D Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K.. Rethinking spatiotemporal feature learning: Speed-accuracy
trade-offs in video classification. ECCV 2018
57. 57
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
58. 58
Dual slow and fast frame rates
#SlowFast Feichtenhofer, Christoph, Haoqi Fan, Jitendra Malik, and Kaiming He. "Slowfast networks for video
recognition." ICCV 2019. [blog] [code]
2 fps
16 fps
High capacity model
Low capacity model
59. 59
Dual slow and fast frame rates
#SlowFast Feichtenhofer, Christoph, Haoqi Fan, Jitendra Malik, and Kaiming He. "Slowfast networks for video recognition."
ICCV 2019. [blog] [code]
Comparison with the state of the art on Kinetics-400
60. 60
Dual slow and fast frame rates
#AVSlowFast Xiao, Fanyi, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. "Audiovisual
SlowFast Networks for Video Recognition." arXiv preprint arXiv:2001.08740 (2020).
61. 61
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. RGB + Optical Flow Two-stream CNN
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
63. 63
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures