@DocXavi
Xavier Giró-i-Nieto
[http://pagines.uab.cat/mcv/]
Module 6
Deep Learning for Video:
Architectures
26th February 2019
Our team
Xavier Giro-i-Nieto
• Web: https://imatge.upc.edu/web/people/xavier-giro
Associate Professor at Universitat Politecnica de Catalunya (UPC)
IDEAI Center for
Intelligent Data Science
& Artificial Intelligence
Our team & Our Deep Research
DL online courses by UPC TelecomBCN:
You can audit this one.
● MSc course [2017] [2018]
● BSc course [2018] [2019]
● 1st edition (2016)
● 2nd edition (2017)
● 3rd edition (2018)
● 4th edition (2019)
● 1st edition (2017)
● 2nd edition (2018)
● 3rd edition - NLP (2019)
Acknowledgements
5
Víctor Campos Alberto MontesAmaia Salvador Santiago Pascual
Video lecture
6
Deep Learning for Computer Vision (UPC 2018)
Víctor Campos
victor.campos@bsc.es
PhD Candidate
Barcelona Supercomputing Center
7
Densely linked slides
Outline
8
1. Architectures
2. Tips and tricks
9
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014
Motivation
10
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. Large-scale video classification with
convolutional neural networks. CVPR 2014.
What is a video?
11
● Formally, a video is a 3D signal
○ Spatial coordinates: x, y
○ Temporal coordinate: t
● If we fix t, we obtain an image. We can understand
videos as sequences of images (a.k.a. frames)
Convolutional Neural Networks (CNN) provide state of the
art performance on image analysis tasks
How do we work with images?
12
How can we extend CNNs to image sequences?
How do we work with videos ?
13
f ŷ
14
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. Two-stream CNN
Deep Video Architectures
15
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. Two-stream CNN
Deep Video Architectures
Single frame models
16
CNN CNN CNN...
Combination method
Combination is commonly
implemented as a small NN on
top of a pooling operation
(e.g. max, sum, average).
Problem: pooling is not aware
of the temporal order!
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and
George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
ŷ
17
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014
Single frame models
18
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014
Single frame models
19
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014
Single frame models
20
João Carreira, Viorica Pătrăucean, Laurent Mazare, Andrew Zisserman, Simon Osindero, “Massively Parallel Video
Networks” ECCV 2018 [video]
Single frame models: Temporal redundancy
21
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. Two-stream CNN
Deep Video Architectures
22Slide credit: Santi Pascual
If we have a sequence of samples...
predict sample x[t+1] knowing previous values {x[t], x[t-1], x[t-2], …, x[t-τ]}
Limitation of Feed Forward NN (as CNNs)
23Slide credit: Santi Pascual
Feed Forward approach:
● static window of size L
● slide the window time-step wise
...
...
...
x[t+1]
x[t-L], …, x[t-1], x[t]
x[t+1]
L
Limitation of Feed Forward NN (as CNNs)
24Slide credit: Santi Pascual
Feed Forward approach:
● static window of size L
● slide the window time-step wise
...
...
...
x[t+2]
x[t-L+1], …, x[t], x[t+1]
...
...
...
x[t+1]
x[t-L], …, x[t-1], x[t]
x[t+2]
L
Limitation of Feed Forward NN (as CNNs)
25Slide credit: Santi Pascual 25
Feed Forward approach:
● static window of size L
● slide the window time-step wise
x[t+3]
L
...
...
...
x[t+3]
x[t-L+2], …, x[t+1], x[t+2]
...
...
...
x[t+2]
x[t-L+1], …, x[t], x[t+1]
...
...
...
x[t+1]
x[t-L], …, x[t-1], x[t]
Limitation of Feed Forward NN (as CNNs)
...
...
...
x1, x2, …, xL
Problems for the feed forward + static window approach:
● What’s the matter increasing L? → Fast growth of num of parameters!
● Decisions are independent between time-steps!
○ The network doesn’t care about what happened at previous time-step, only present window
matters → doesn’t look good
● Cumbersome padding when there are not enough samples to fill L size
○ Can’t work with variable sequence lengths
x1, x2, …, xL, …, x2L
...
...
x1, x2, …, xL, …, x2L, …, x3L
...
...
... ...
Slide credit: Santi Pascual
Limitation of Feed Forward NN (as CNNs)
27
The hidden layers and the
output depend from previous
states of the hidden layers
Recurrent Neural Network (RNN)
2D CNN + RNN
28
CNN CNN CNN...
Recurrent Neural Networks
are well suited for processing
sequences.
Problem: RNNs are sequential
and cannot be parallelized.
RNN RNN RNN...
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Videolectures on RNNs:
DLSL 2017, “RNN (I)”
“RNN (II)”
DLAI 2018, “RNN”
29
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
2D CNN + RNN
30
Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to
Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
2D CNN + RNN: redundancy
CNN CNN CNN...
RNN RNN RNN... After processing a frame, let
the RNN decide how many
future frames can be skipped
In skipped frames, simply
copy the output and state
from the previous time step
There is no ground truth for
which frames can be skipped.
The RNN learns it by itself
during training!
31
Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to
Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
Used
Unused
2D CNN + RNN: redundancy
32
Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to
Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
2D CNN + RNN
Epochs for Skip LSTM (λ = 10-4
)
~30% acc ~50% acc ~70% acc ~95% acc
11 epochs 51 epochs 101 epochs 400 epochs
Used
Unused
33
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. Two-stream CNN
Deep Video Architectures
3D CNN (C3D)
34
We can add an extra dimension to standard CNNs:
● An image is a HxWxD tensor: MxNxD’ conv filters
● A video is a TxHxWxD tensor: KxMxNxD’ conv filters
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks" ICCV 2015
35
The video needs to be split into chunks (also known as clips) with a number of
frames that fits the receptive field of the C3D. Usually clips have 16 frames.
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks" ICCV 2015
16-frame clip
16-frame clip
16-frame clip
...
Average
4096-dimvideodescriptor
4096-dimvideodescriptor
L2 norm
3D CNN (C3D)
36
3D convolutions + Residual connections
Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture search
for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code]
3D CNN (Res3D)
37
Limitation:
● How can we use pre-trained networks?
3D CNN
38
Re-use pre-trained 2D CNNs for 3D CNNs
We can add an extra dimension to standard CNNs:
● An image is a HxWxD tensor: MxNxD’ conv filters
● A video is a TxHxWxD tensor: KxMxNxD’ conv filters
We can convert an MxNxD’ conv filter into a KxMxNxD’ filter by replicating it K
times in the time axis and scaling its values by 1/K.
● This allows to leverage networks pre-trained on ImageNet and alleviate
the computational burden associated to training from scratch
Feichtenhofer et al., Spatiotemporal Residual Networks for Video Action Recognition, NIPS 2016
Carreira et al., Quo vadis, action recognition? A new model and the kinetics dataset, CVPR 2017
39
Limitations:
● How can we handle longer videos?
● How can we capture longer temporal dependencies?
3D CNN
40
3D CNN + RNN
A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in
Untrimmed Videos with Recurrent Neural Networks”, NIPS Workshop 2016 (best poster award)
41
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. Two-stream CNN
Deep Video Architectures
Two-streams 2D CNNs
42
Problem: Single frame models do not take into account motion in videos.
Solution: extract optical flow for a stack of frames and use it as an input to a
CNN.
Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in
videos." NIPS 2014.
Fusion
Two-streams 2D CNNs
43Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action
recognition." CVPR 2016. [code]
Fusion
Problem: Single frame models do not take into account motion in videos.
Solution: extract optical flow for a stack of frames and use it as an input to a
CNN.
Two-streams 2D CNNs
44
Feichtenhofer, Christoph, Axel Pinz, and Richard Wildes. "Spatiotemporal residual networks for video action recognition."
NIPS 2016. [code]
Problem: Single frame models do not take into account motion in videos.
Solution: extract optical flow for a stack of frames and use it as an input to a
CNN.
Two-streams 3D CNNs
45
Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR
2017. [code]
Two-streams 3D CNNs
46
Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR
2017. [code]
Two-streams 2D CNNs + RNN
47
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici.
"Beyond short snippets: Deep networks for video classification." CVPR 2015
Outline
48
1. Architectures
2. Tips and tricks
Large-scale datasets
49
● The reference dataset for image classification, ImageNet, has ~1.3M
images
○ Training a state of the art CNN can take up to 2 weeks on a
single GPU
● Now imagine that we have an ‘ImageNet’ of 1.3M videos
○ Assuming videos of 30s at 24fps, we have 936M frames
○ This is 720x ImageNet!
● Videos exhibit a large redundancy in time
○ We can reduce the frame rate without losing too much
information
Tips & Tricks by Víctor Campos (2017)
Memory issues
50
● Current GPUs can fit batches of 32~64 images when training state of
the art CNNs
○ This means 32~64 video frames at once
● Memory footprint can be reduced in different ways if a pre-trained
CNN model is used
○ Freezing some of the lower layers, reducing the memory impact
of backprop
○ Extracting frame-level features and training a model on top of it
(e.g. RNN on top of CNN features). This is equivalent to freezing
the whole architecture, but the CNN part needs to be computed
only once.
Tips & Tricks by Víctor Campos (2017)
I/O bottleneck
51
● In practice, applying deep learning to video analysis requires from
multi-GPU or distributed settings
● In such settings it is very important to avoid starving the GPUs or we
will not obtain any speedup
○ The next batch needs to be loaded and preprocessed to keep
the GPU as busy as possible
○ Using asynchronous data loading pipelines is a key factor
○ Loading individual files is slow due to the introduced overhead,
so using other formats such as TFRecord/HDF5/LMDB is highly
recommended
Tips & Tricks by Víctor Campos (2017)
52
Learn more
Rohit Ghosh, “Deep Learning for Videos: A 2018 Guide for Action recognition” (2018)
53
Questions ?

Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019

  • 1.
    @DocXavi Xavier Giró-i-Nieto [http://pagines.uab.cat/mcv/] Module 6 DeepLearning for Video: Architectures 26th February 2019
  • 2.
    Our team Xavier Giro-i-Nieto •Web: https://imatge.upc.edu/web/people/xavier-giro Associate Professor at Universitat Politecnica de Catalunya (UPC) IDEAI Center for Intelligent Data Science & Artificial Intelligence
  • 3.
    Our team &Our Deep Research
  • 4.
    DL online coursesby UPC TelecomBCN: You can audit this one. ● MSc course [2017] [2018] ● BSc course [2018] [2019] ● 1st edition (2016) ● 2nd edition (2017) ● 3rd edition (2018) ● 4th edition (2019) ● 1st edition (2017) ● 2nd edition (2018) ● 3rd edition - NLP (2019)
  • 5.
    Acknowledgements 5 Víctor Campos AlbertoMontesAmaia Salvador Santiago Pascual
  • 6.
    Video lecture 6 Deep Learningfor Computer Vision (UPC 2018) Víctor Campos victor.campos@bsc.es PhD Candidate Barcelona Supercomputing Center
  • 7.
  • 8.
  • 9.
    9 Karpathy, A., Toderici,G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014
  • 10.
    Motivation 10 Karpathy, A., Toderici,G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. Large-scale video classification with convolutional neural networks. CVPR 2014.
  • 11.
    What is avideo? 11 ● Formally, a video is a 3D signal ○ Spatial coordinates: x, y ○ Temporal coordinate: t ● If we fix t, we obtain an image. We can understand videos as sequences of images (a.k.a. frames)
  • 12.
    Convolutional Neural Networks(CNN) provide state of the art performance on image analysis tasks How do we work with images? 12
  • 13.
    How can weextend CNNs to image sequences? How do we work with videos ? 13 f ŷ
  • 14.
    14 Basic deep architecturesfor video: 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. Two-stream CNN Deep Video Architectures
  • 15.
    15 Basic deep architecturesfor video: 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. Two-stream CNN Deep Video Architectures
  • 16.
    Single frame models 16 CNNCNN CNN... Combination method Combination is commonly implemented as a small NN on top of a pooling operation (e.g. max, sum, average). Problem: pooling is not aware of the temporal order! Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015 ŷ
  • 17.
    17 Karpathy, A., Toderici,G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014 Single frame models
  • 18.
    18 Karpathy, A., Toderici,G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014 Single frame models
  • 19.
    19 Karpathy, A., Toderici,G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014 Single frame models
  • 20.
    20 João Carreira, VioricaPătrăucean, Laurent Mazare, Andrew Zisserman, Simon Osindero, “Massively Parallel Video Networks” ECCV 2018 [video] Single frame models: Temporal redundancy
  • 21.
    21 Basic deep architecturesfor video: 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. Two-stream CNN Deep Video Architectures
  • 22.
    22Slide credit: SantiPascual If we have a sequence of samples... predict sample x[t+1] knowing previous values {x[t], x[t-1], x[t-2], …, x[t-τ]} Limitation of Feed Forward NN (as CNNs)
  • 23.
    23Slide credit: SantiPascual Feed Forward approach: ● static window of size L ● slide the window time-step wise ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] x[t+1] L Limitation of Feed Forward NN (as CNNs)
  • 24.
    24Slide credit: SantiPascual Feed Forward approach: ● static window of size L ● slide the window time-step wise ... ... ... x[t+2] x[t-L+1], …, x[t], x[t+1] ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] x[t+2] L Limitation of Feed Forward NN (as CNNs)
  • 25.
    25Slide credit: SantiPascual 25 Feed Forward approach: ● static window of size L ● slide the window time-step wise x[t+3] L ... ... ... x[t+3] x[t-L+2], …, x[t+1], x[t+2] ... ... ... x[t+2] x[t-L+1], …, x[t], x[t+1] ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] Limitation of Feed Forward NN (as CNNs)
  • 26.
    ... ... ... x1, x2, …,xL Problems for the feed forward + static window approach: ● What’s the matter increasing L? → Fast growth of num of parameters! ● Decisions are independent between time-steps! ○ The network doesn’t care about what happened at previous time-step, only present window matters → doesn’t look good ● Cumbersome padding when there are not enough samples to fill L size ○ Can’t work with variable sequence lengths x1, x2, …, xL, …, x2L ... ... x1, x2, …, xL, …, x2L, …, x3L ... ... ... ... Slide credit: Santi Pascual Limitation of Feed Forward NN (as CNNs)
  • 27.
    27 The hidden layersand the output depend from previous states of the hidden layers Recurrent Neural Network (RNN)
  • 28.
    2D CNN +RNN 28 CNN CNN CNN... Recurrent Neural Networks are well suited for processing sequences. Problem: RNNs are sequential and cannot be parallelized. RNN RNN RNN... Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code Videolectures on RNNs: DLSL 2017, “RNN (I)” “RNN (II)” DLAI 2018, “RNN”
  • 29.
    29 Jeffrey Donahue, LisaAnne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code 2D CNN + RNN
  • 30.
    30 Victor Campos, BrendanJou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018. 2D CNN + RNN: redundancy CNN CNN CNN... RNN RNN RNN... After processing a frame, let the RNN decide how many future frames can be skipped In skipped frames, simply copy the output and state from the previous time step There is no ground truth for which frames can be skipped. The RNN learns it by itself during training!
  • 31.
    31 Victor Campos, BrendanJou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018. Used Unused 2D CNN + RNN: redundancy
  • 32.
    32 Victor Campos, BrendanJou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018. 2D CNN + RNN Epochs for Skip LSTM (λ = 10-4 ) ~30% acc ~50% acc ~70% acc ~95% acc 11 epochs 51 epochs 101 epochs 400 epochs Used Unused
  • 33.
    33 Basic deep architecturesfor video: 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. Two-stream CNN Deep Video Architectures
  • 34.
    3D CNN (C3D) 34 Wecan add an extra dimension to standard CNNs: ● An image is a HxWxD tensor: MxNxD’ conv filters ● A video is a TxHxWxD tensor: KxMxNxD’ conv filters Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks" ICCV 2015
  • 35.
    35 The video needsto be split into chunks (also known as clips) with a number of frames that fits the receptive field of the C3D. Usually clips have 16 frames. Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks" ICCV 2015 16-frame clip 16-frame clip 16-frame clip ... Average 4096-dimvideodescriptor 4096-dimvideodescriptor L2 norm 3D CNN (C3D)
  • 36.
    36 3D convolutions +Residual connections Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code] 3D CNN (Res3D)
  • 37.
    37 Limitation: ● How canwe use pre-trained networks? 3D CNN
  • 38.
    38 Re-use pre-trained 2DCNNs for 3D CNNs We can add an extra dimension to standard CNNs: ● An image is a HxWxD tensor: MxNxD’ conv filters ● A video is a TxHxWxD tensor: KxMxNxD’ conv filters We can convert an MxNxD’ conv filter into a KxMxNxD’ filter by replicating it K times in the time axis and scaling its values by 1/K. ● This allows to leverage networks pre-trained on ImageNet and alleviate the computational burden associated to training from scratch Feichtenhofer et al., Spatiotemporal Residual Networks for Video Action Recognition, NIPS 2016 Carreira et al., Quo vadis, action recognition? A new model and the kinetics dataset, CVPR 2017
  • 39.
    39 Limitations: ● How canwe handle longer videos? ● How can we capture longer temporal dependencies? 3D CNN
  • 40.
    40 3D CNN +RNN A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks”, NIPS Workshop 2016 (best poster award)
  • 41.
    41 Basic deep architecturesfor video: 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. Two-stream CNN Deep Video Architectures
  • 42.
    Two-streams 2D CNNs 42 Problem:Single frame models do not take into account motion in videos. Solution: extract optical flow for a stack of frames and use it as an input to a CNN. Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." NIPS 2014. Fusion
  • 43.
    Two-streams 2D CNNs 43Feichtenhofer,Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code] Fusion Problem: Single frame models do not take into account motion in videos. Solution: extract optical flow for a stack of frames and use it as an input to a CNN.
  • 44.
    Two-streams 2D CNNs 44 Feichtenhofer,Christoph, Axel Pinz, and Richard Wildes. "Spatiotemporal residual networks for video action recognition." NIPS 2016. [code] Problem: Single frame models do not take into account motion in videos. Solution: extract optical flow for a stack of frames and use it as an input to a CNN.
  • 45.
    Two-streams 3D CNNs 45 Carreira,J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
  • 46.
    Two-streams 3D CNNs 46 Carreira,J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
  • 47.
    Two-streams 2D CNNs+ RNN 47 Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
  • 48.
  • 49.
    Large-scale datasets 49 ● Thereference dataset for image classification, ImageNet, has ~1.3M images ○ Training a state of the art CNN can take up to 2 weeks on a single GPU ● Now imagine that we have an ‘ImageNet’ of 1.3M videos ○ Assuming videos of 30s at 24fps, we have 936M frames ○ This is 720x ImageNet! ● Videos exhibit a large redundancy in time ○ We can reduce the frame rate without losing too much information Tips & Tricks by Víctor Campos (2017)
  • 50.
    Memory issues 50 ● CurrentGPUs can fit batches of 32~64 images when training state of the art CNNs ○ This means 32~64 video frames at once ● Memory footprint can be reduced in different ways if a pre-trained CNN model is used ○ Freezing some of the lower layers, reducing the memory impact of backprop ○ Extracting frame-level features and training a model on top of it (e.g. RNN on top of CNN features). This is equivalent to freezing the whole architecture, but the CNN part needs to be computed only once. Tips & Tricks by Víctor Campos (2017)
  • 51.
    I/O bottleneck 51 ● Inpractice, applying deep learning to video analysis requires from multi-GPU or distributed settings ● In such settings it is very important to avoid starving the GPUs or we will not obtain any speedup ○ The next batch needs to be loaded and preprocessed to keep the GPU as busy as possible ○ Using asynchronous data loading pipelines is a key factor ○ Loading individual files is slow due to the introduced overhead, so using other formats such as TFRecord/HDF5/LMDB is highly recommended Tips & Tricks by Víctor Campos (2017)
  • 52.
    52 Learn more Rohit Ghosh,“Deep Learning for Videos: A 2018 Guide for Action recognition” (2018)
  • 53.