Neural Architectures for Video Encoding

[http://pagines.uab.cat/mcv/]
Module 6 - Day 2 - Lecture 1
Neural Architectures for
Video Encoding
27th February 2020
Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center

Acknowledgements
2
Víctor Campos Alberto MontesAmaia Salvador Santiago Pascual

Video lectures
3
Víctor Campos
Deep Learning for Computer Vision 2018
UPC TelecomBCN
Xavier Giro-i-Nieto
Deep Learning for Computer Vision 2019
UPC TelecomBCN

4
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale
video classiﬁcation with convolutional neural networks. CVPR 2014

Motivation
5
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. Large-scale video classiﬁcation
with convolutional neural networks. CVPR 2014.

What is a video?
6
● Formally, a video is a 3D signal
○ Spatial coordinates: x, y
○ Temporal coordinate: t
● If we fix t, we obtain an image. We can understand
videos as sequences of images (a.k.a. frames)

Convolutional Neural Networks (CNN) provide state of the
art performance on still image analysis tasks
How do we work with images?
7

How can we extend CNNs to image sequences?
How do we work with videos ?
8
f ŷ

9
Basic deep architectures for video:
1. Single frame models
2. Spatio-temporal convolutions
3. CNN + RNN
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures

10
3. CNN + RNN
6. Capsule Networks

Single frame models
11
CNN CNN CNN...
Combination method
Combination is commonly
implemented as a small NN on
top of a temporal pooling
operation (e.g. max, average).
Problem: pooling is not aware
of the temporal order!
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and
George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
ŷ

12
Single frame models

13
Single frame models

14
Single frame models

15
3. CNN + RNN
4. RGB + Optical Flow Two-stream CNN

3D CNN (C3D)
16
We can add an extra dimension to standard 2D CNN filters:
● A grayscale video clip may be represented by a tensor of size H x W x L.
● A 3D conv filter for the 1st conv layer may be of size k x k x d
#C3D Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features
with 3D convolutional networks" ICCV 2015
time

17
The video needs to be split into chunks (also known as clips) with a number of
frames that fits the receptive field of the C3D (eg. 16 frames).
#C3D Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features
with 3D convolutional networks" ICCV 2015
16-frame clip
16-frame clip
16-frame clip
...
Average
pooling
4096-dimvideodescriptor
4096-dimvideodescriptor
L2 norm
3D CNN (C3D) + temporal pooling

18
3D CNN (C3D)
#3D Resnet Hara, Kensho, Hirokatsu Kataoka, and Yutaka Satoh. "Can spatiotemporal 3d cnns retrace the history of 2d
cnns and imagenet?." CVPR 2018. [code]
Much larger datasets (eg. Kinetics) allow training much deeper C3D CNNs.

19
Limitation:
● How can we use pre-trained 2D networks to initialize C3D training ?
3D CNN

20
Inflated 3D weights (I3D)
We can adapt C2D filters to C3D by replicating them along the temporal axis
(and scaling its values by 1/d).
inflation
#I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
k x k
C2D filter I3D filter
k x k x d
Filter inflation allows reusing C2D pre-trained filters (eg. from ImageNet) to
initialize C3D filters.

21
Residual Learning
Residual learning: reformulate the layers as learning residual functions with
respect to the identity F(x), instead of learning unreferenced functions H(x)
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." CVPR 2016. [slides]
Plain Net Residual Net

22
#Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture
search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code]
3D CNN + Residual (Res3D)
2D ResNet Res3D

23

24
Gain in performance comes with the cost of much more computation and larger
memory footprint.

25
#MC Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A closer
look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
Mixed Convolutions (MC & rMC)
Mixed Convolutions (MC)
● Early layers: 3D convolutions
● Top layers: 2D convolutions
Reversed Mixed Convolutions (rMC)
● Early layers: 2D convolutions
● Top layers: 3D convolutions

26
#MC Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A closer
look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
MC ResNets:
● 3-4% gain in clip-level accuracy over 2D ResNets of comparable capacity
● match the performance of 3D ResNets, which have 3 times as many parameters.
Mixed Convolutions (MC & rMC)

27
#R(2+1)D Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A
closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
C(2+1)D
Splits the computation into a spatial 2D convolution followed by a temporal 1D
convolution:. In the case of an input of a single feature channel:
C3D C(2+1)D
Number of 2D ﬁlters

28
C(2+1D) + Residual

29
2+1D CNN + Residual

30
Separable 2D (group) convolutions
Depth-wise (in channels) separable convolutions.
#XCeption Chollet, François. "Xception: Deep learning with depthwise separable convolutions." CVPR 2017.
Figure: Sik-Ho Tsang

31
Res3D + Separable (group) convolutions
#CSN D. Tran, H. Wang, L. Torresani and M. Feiszli. Video Classiﬁcation with Channel-Separated
Convolutional Networks. ICCV 2019. [Caﬀe2]
Channel interactions and spatiotemporal interactions are separated.
1 group 2 groups
# groups =
# input channels

32
3. CNN + RNN
6. Capsule Networks

33
The hidden layers and the
output depend from previous
states of the hidden layers
Recurrent Neural Network (RNN)

2D CNN + RNN
34
CNN CNN CNN...
Recurrent Neural Networks
are well suited for processing
sequences.
Problem: RNNs are sequential
and cannot be parallelized.
RNN RNN RNN...
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Videolectures on RNNs:
DLSL 2017, “RNN (I)”
“RNN (II)”
DLAI 2018, “RNN”

35
2D CNN + RNN vs Spatio-temporal Convs

36
Jeﬀrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. [code]
2D CNN + RNN

37
Limitations:
● How can we handle longer videos?
● How can we capture longer temporal dependencies?
3D CNN

38
3D CNN + RNN
A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in Untrimmed
Videos with Recurrent Neural Networks”, NIPS Workshop 2016 (best poster award)

39
#SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip
RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
2D CNN + RNN: skip redundancy
CNN CNN CNN...
RNN RNN RNN... After processing a frame, let
the RNN decide how many
future frames can be skipped
In skipped frames, simply
copy the output and state
from the previous time step
There is no ground truth for
which frames can be skipped.
The RNN learns it by itself
during training!

40
2D CNN + RNN
Epochs for Skip LSTM (λ = 10-4
)
~30% acc ~50% acc ~70% acc ~95% acc
11 epochs 51 epochs 101 epochs 400 epochs
Used
Unused

41
Used
Unused
2D CNN + RNN: skip redundancy

42
2. CNN + RNN
3. 3D convolutions
6. Capsule Networks

43
Ilg, Eddy, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. "Flownet 2.0: Evolution of
optical ﬂow estimation with deep networks." CVPR 2017. [code]

Optical ﬂow
44
Popular implementations for optical flow:
#Coarse2Fine Brox, Thomas, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. "High accuracy optical ﬂow estimation
based on a theory for warping." ECCV 2004.
● PyFlow
● FlowNet2
● Improved Dense Trajectories - IDT
● Lucas-Kanade (OpenCV)
● Farneback 2003 (OpenCV)
● Middlebury
● Horn and schunk
● Tikhonov regularized and vectorized
● DeepFlow (2013)
● (...)

Optical ﬂow
45
Deep Neural Networks have actually also been trained to predict optical flow:
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., FlowNet:
Learning Optical Flow With Convolutional Networks. ICCV 2015

Two-streams 2D CNNs
46
Problem: Single frame models do not take into account motion in videos.
Solution: extract optical flow for a stack of frames and use it as an input to a
CNN.
Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in
videos." NIPS 2014.
Fusion

Two-streams 2D CNNs
47Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action
recognition." CVPR 2016. [code]
Fusion
Solution: extract optical flow for a stack of frames and use it as an input to a
CNN.

Two-streams 2D CNNs + Residual
48
Feichtenhofer, Christoph, Axel Pinz, and Richard Wildes. "Spatiotemporal residual networks for video action recognition."
NIPS 2016. [code]
Solution: Extract optical flow for a stack of frames and feed it features in the
appearance stream.

Two-streams + Inﬂated 3D
49#I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]

Two-streams 3D CNNs
50#I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]

Two-streams 2D CNNs + RNN
51
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici.
"Beyond short snippets: Deep networks for video classiﬁcation." CVPR 2015

One stream RGB + Optical Flow
52
#2in1 Zhao, Jiaojiao, and Cees GM Snoek. "Dance with Flow: Two-in-One Stream Action Detection." CVPR 2019.
Leveraging the motion condition to modulate RGB features improves
detection accuracy.

53

54
2in1 has:
● half the computation and parameters of a two-stream equivalent
● better action detection accuracy (but worse for action classification).

Two-streams (feature visualization)
55
Feichtenhofer, Christoph, Axel Pinz, Richard P. Wildes, and Andrew Zisserman. "What have we learned from deep
representations for action recognition?." CVPR 2018.
Feichtenhofer, Christoph, Axel Pinz, Richard P. Wildes, and Andrew Zisserman. "Deep Insights into Convolutional Networks
for Video Recognition." IJCV 2019.

Replacing C3D for C2D
56#S3D Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K.. Rethinking spatiotemporal feature learning: Speed-accuracy
trade-oﬀs in video classiﬁcation. ECCV 2018

57
2. CNN + RNN
3. 3D convolutions
6. Capsule Networks

58
Dual slow and fast frame rates
#SlowFast Feichtenhofer, Christoph, Haoqi Fan, Jitendra Malik, and Kaiming He. "Slowfast networks for video
recognition." ICCV 2019. [blog] [code]
2 fps
16 fps
High capacity model
Low capacity model

59
#SlowFast Feichtenhofer, Christoph, Haoqi Fan, Jitendra Malik, and Kaiming He. "Slowfast networks for video recognition."
ICCV 2019. [blog] [code]
Comparison with the state of the art on Kinetics-400

60
#AVSlowFast Xiao, Fanyi, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. "Audiovisual
SlowFast Networks for Video Recognition." arXiv preprint arXiv:2001.08740 (2020).

61
2. CNN + RNN
3. 3D convolutions
4. RGB + Optical Flow Two-stream CNN
6. Capsule Networks

62
Capsule Networks
#Videocapsulenet Duarte, Kevin, Yogesh Rawat, and Mubarak Shah. "Videocapsulenet: A simpliﬁed network for action
detection." NeurIPS 2018. [code]

63
2. CNN + RNN
3. 3D convolutions
6. Capsule Networks

64
Implementations
PyTorch
● Facebook FAIR
○ SlowFast
○ SlowOnly
○ C2D
○ I3D
○ Non-local Network
● Torchvision
○ ResNet 3D 18
○ ResNet MC 18
○ ResNet(2+1)D

65
Learn more
Rohit Ghosh, “Deep Learning for Videos: A 2018 Guide for Action recognition” (2018)

66
Related lectures [index]
Object tracking
[slides] [video]
Video Object segmentation
[slides] [video]
Self-supervised Learning from
Video Sequences [slides] [video]
Self-supervised Audiovisual
Learning [slides] [video]

Neural Architectures for Video Encoding

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Neural Architectures for Video Encoding

Similar to Neural Architectures for Video Encoding (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Neural Architectures for Video Encoding