SlideShare a Scribd company logo
1 of 67
Download to read offline
[http://pagines.uab.cat/mcv/]
Module 6 - Day 2 - Lecture 1
Neural Architectures for
Video Encoding
27th February 2020
Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
Acknowledgements
2
Víctor Campos Alberto MontesAmaia Salvador Santiago Pascual
Video lectures
3
Víctor Campos
Deep Learning for Computer Vision 2018
UPC TelecomBCN
Xavier Giro-i-Nieto
Deep Learning for Computer Vision 2019
UPC TelecomBCN
4
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale
video classification with convolutional neural networks. CVPR 2014
Motivation
5
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. Large-scale video classification
with convolutional neural networks. CVPR 2014.
What is a video?
6
● Formally, a video is a 3D signal
○ Spatial coordinates: x, y
○ Temporal coordinate: t
● If we fix t, we obtain an image. We can understand
videos as sequences of images (a.k.a. frames)
Convolutional Neural Networks (CNN) provide state of the
art performance on still image analysis tasks
How do we work with images?
7
How can we extend CNNs to image sequences?
How do we work with videos ?
8
f ŷ
9
Basic deep architectures for video:
1. Single frame models
2. Spatio-temporal convolutions
3. CNN + RNN
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
10
Basic deep architectures for video:
1. Single frame models
2. Spatio-temporal convolutions
3. CNN + RNN
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
Single frame models
11
CNN CNN CNN...
Combination method
Combination is commonly
implemented as a small NN on
top of a temporal pooling
operation (e.g. max, average).
Problem: pooling is not aware
of the temporal order!
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and
George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
ŷ
12
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale
video classification with convolutional neural networks. CVPR 2014
Single frame models
13
Single frame models
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale
video classification with convolutional neural networks. CVPR 2014
14
#DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale
video classification with convolutional neural networks. CVPR 2014
Single frame models
15
Basic deep architectures for video:
1. Single frame models
2. Spatio-temporal convolutions
3. CNN + RNN
4. RGB + Optical Flow Two-stream CNN
Deep Video Architectures
3D CNN (C3D)
16
We can add an extra dimension to standard 2D CNN filters:
● A grayscale video clip may be represented by a tensor of size H x W x L.
● A 3D conv filter for the 1st conv layer may be of size k x k x d
#C3D Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features
with 3D convolutional networks" ICCV 2015
time
17
The video needs to be split into chunks (also known as clips) with a number of
frames that fits the receptive field of the C3D (eg. 16 frames).
#C3D Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features
with 3D convolutional networks" ICCV 2015
16-frame clip
16-frame clip
16-frame clip
...
Average
pooling
4096-dimvideodescriptor
4096-dimvideodescriptor
L2 norm
3D CNN (C3D) + temporal pooling
18
3D CNN (C3D)
#3D Resnet Hara, Kensho, Hirokatsu Kataoka, and Yutaka Satoh. "Can spatiotemporal 3d cnns retrace the history of 2d
cnns and imagenet?." CVPR 2018. [code]
Much larger datasets (eg. Kinetics) allow training much deeper C3D CNNs.
19
Limitation:
● How can we use pre-trained 2D networks to initialize C3D training ?
3D CNN
20
Inflated 3D weights (I3D)
We can adapt C2D filters to C3D by replicating them along the temporal axis
(and scaling its values by 1/d).
inflation
#I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
k x k
C2D filter I3D filter
k x k x d
Filter inflation allows reusing C2D pre-trained filters (eg. from ImageNet) to
initialize C3D filters.
21
Residual Learning
Residual learning: reformulate the layers as learning residual functions with
respect to the identity F(x), instead of learning unreferenced functions H(x)
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." CVPR 2016. [slides]
Plain Net Residual Net
22
#Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture
search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code]
3D CNN + Residual (Res3D)
2D ResNet Res3D
23
#Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture
search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code]
3D CNN + Residual (Res3D)
24
#Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture
search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code]
3D CNN + Residual (Res3D)
Gain in performance comes with the cost of much more computation and larger
memory footprint.
25
#MC Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A closer
look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
Mixed Convolutions (MC & rMC)
Mixed Convolutions (MC)
● Early layers: 3D convolutions
● Top layers: 2D convolutions
Reversed Mixed Convolutions (rMC)
● Early layers: 2D convolutions
● Top layers: 3D convolutions
26
#MC Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A closer
look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
MC ResNets:
● 3-4% gain in clip-level accuracy over 2D ResNets of comparable capacity
● match the performance of 3D ResNets, which have 3 times as many parameters.
Mixed Convolutions (MC & rMC)
27
#R(2+1)D Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A
closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
C(2+1)D
Splits the computation into a spatial 2D convolution followed by a temporal 1D
convolution:. In the case of an input of a single feature channel:
C3D C(2+1)D
Number of 2D filters
28
#R(2+1)D Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A
closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
C(2+1D) + Residual
29
#R(2+1)D Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A
closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code]
2+1D CNN + Residual
30
Separable 2D (group) convolutions
Depth-wise (in channels) separable convolutions.
#XCeption Chollet, François. "Xception: Deep learning with depthwise separable convolutions." CVPR 2017.
Figure: Sik-Ho Tsang
31
Res3D + Separable (group) convolutions
#CSN D. Tran, H. Wang, L. Torresani and M. Feiszli. Video Classification with Channel-Separated
Convolutional Networks. ICCV 2019. [Caffe2]
Channel interactions and spatiotemporal interactions are separated.
1 group 2 groups
# groups =
# input channels
32
Basic deep architectures for video:
1. Single frame models
2. Spatio-temporal convolutions
3. CNN + RNN
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
33
The hidden layers and the
output depend from previous
states of the hidden layers
Recurrent Neural Network (RNN)
2D CNN + RNN
34
CNN CNN CNN...
Recurrent Neural Networks
are well suited for processing
sequences.
Problem: RNNs are sequential
and cannot be parallelized.
RNN RNN RNN...
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Videolectures on RNNs:
DLSL 2017, “RNN (I)”
“RNN (II)”
DLAI 2018, “RNN”
35
#Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture
search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code]
2D CNN + RNN vs Spatio-temporal Convs
36
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. [code]
2D CNN + RNN
37
Limitations:
● How can we handle longer videos?
● How can we capture longer temporal dependencies?
3D CNN
38
3D CNN + RNN
A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in Untrimmed
Videos with Recurrent Neural Networks”, NIPS Workshop 2016 (best poster award)
39
#SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip
RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
2D CNN + RNN: skip redundancy
CNN CNN CNN...
RNN RNN RNN... After processing a frame, let
the RNN decide how many
future frames can be skipped
In skipped frames, simply
copy the output and state
from the previous time step
There is no ground truth for
which frames can be skipped.
The RNN learns it by itself
during training!
40
#SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip
RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
2D CNN + RNN
Epochs for Skip LSTM (λ = 10-4
)
~30% acc ~50% acc ~70% acc ~95% acc
11 epochs 51 epochs 101 epochs 400 epochs
Used
Unused
41
Used
Unused
2D CNN + RNN: skip redundancy
#SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip
RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
42
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
43
Ilg, Eddy, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. "Flownet 2.0: Evolution of
optical flow estimation with deep networks." CVPR 2017. [code]
Optical flow
44
Popular implementations for optical flow:
#Coarse2Fine Brox, Thomas, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. "High accuracy optical flow estimation
based on a theory for warping." ECCV 2004.
● PyFlow
● FlowNet2
● Improved Dense Trajectories - IDT
● Lucas-Kanade (OpenCV)
● Farneback 2003 (OpenCV)
● Middlebury
● Horn and schunk
● Tikhonov regularized and vectorized
● DeepFlow (2013)
● (...)
Optical flow
45
Deep Neural Networks have actually also been trained to predict optical flow:
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., FlowNet:
Learning Optical Flow With Convolutional Networks. ICCV 2015
Two-streams 2D CNNs
46
Problem: Single frame models do not take into account motion in videos.
Solution: extract optical flow for a stack of frames and use it as an input to a
CNN.
Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in
videos." NIPS 2014.
Fusion
Two-streams 2D CNNs
47Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action
recognition." CVPR 2016. [code]
Fusion
Problem: Single frame models do not take into account motion in videos.
Solution: extract optical flow for a stack of frames and use it as an input to a
CNN.
Two-streams 2D CNNs + Residual
48
Feichtenhofer, Christoph, Axel Pinz, and Richard Wildes. "Spatiotemporal residual networks for video action recognition."
NIPS 2016. [code]
Problem: Single frame models do not take into account motion in videos.
Solution: Extract optical flow for a stack of frames and feed it features in the
appearance stream.
Two-streams + Inflated 3D
49#I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
Two-streams 3D CNNs
50#I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
Two-streams 2D CNNs + RNN
51
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici.
"Beyond short snippets: Deep networks for video classification." CVPR 2015
One stream RGB + Optical Flow
52
#2in1 Zhao, Jiaojiao, and Cees GM Snoek. "Dance with Flow: Two-in-One Stream Action Detection." CVPR 2019.
Leveraging the motion condition to modulate RGB features improves
detection accuracy.
One stream RGB + Optical Flow
53
#2in1 Zhao, Jiaojiao, and Cees GM Snoek. "Dance with Flow: Two-in-One Stream Action Detection." CVPR 2019.
One stream RGB + Optical Flow
54
#2in1 Zhao, Jiaojiao, and Cees GM Snoek. "Dance with Flow: Two-in-One Stream Action Detection." CVPR 2019.
2in1 has:
● half the computation and parameters of a two-stream equivalent
● better action detection accuracy (but worse for action classification).
Two-streams (feature visualization)
55
Feichtenhofer, Christoph, Axel Pinz, Richard P. Wildes, and Andrew Zisserman. "What have we learned from deep
representations for action recognition?." CVPR 2018.
Feichtenhofer, Christoph, Axel Pinz, Richard P. Wildes, and Andrew Zisserman. "Deep Insights into Convolutional Networks
for Video Recognition." IJCV 2019.
Replacing C3D for C2D
56#S3D Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K.. Rethinking spatiotemporal feature learning: Speed-accuracy
trade-offs in video classification. ECCV 2018
57
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
58
Dual slow and fast frame rates
#SlowFast Feichtenhofer, Christoph, Haoqi Fan, Jitendra Malik, and Kaiming He. "Slowfast networks for video
recognition." ICCV 2019. [blog] [code]
2 fps
16 fps
High capacity model
Low capacity model
59
Dual slow and fast frame rates
#SlowFast Feichtenhofer, Christoph, Haoqi Fan, Jitendra Malik, and Kaiming He. "Slowfast networks for video recognition."
ICCV 2019. [blog] [code]
Comparison with the state of the art on Kinetics-400
60
Dual slow and fast frame rates
#AVSlowFast Xiao, Fanyi, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. "Audiovisual
SlowFast Networks for Video Recognition." arXiv preprint arXiv:2001.08740 (2020).
61
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. RGB + Optical Flow Two-stream CNN
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
62
Capsule Networks
#Videocapsulenet Duarte, Kevin, Yogesh Rawat, and Mubarak Shah. "Videocapsulenet: A simplified network for action
detection." NeurIPS 2018. [code]
63
Basic deep architectures for video:
1. Single frame models
2. CNN + RNN
3. 3D convolutions
4. RGB + Optical Flow
5. Dual slow and fast frame rates
6. Capsule Networks
Deep Video Architectures
64
Implementations
PyTorch
● Facebook FAIR
○ SlowFast
○ SlowOnly
○ C2D
○ I3D
○ Non-local Network
● Torchvision
○ ResNet 3D 18
○ ResNet MC 18
○ ResNet(2+1)D
65
Learn more
Rohit Ghosh, “Deep Learning for Videos: A 2018 Guide for Action recognition” (2018)
66
Related lectures [index]
Object tracking
[slides] [video]
Video Object segmentation
[slides] [video]
Self-supervised Learning from
Video Sequences [slides] [video]
Self-supervised Audiovisual
Learning [slides] [video]
67
Questions ?

More Related Content

What's hot

Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaUniversitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Universitat Politècnica de Catalunya
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018Universitat Politècnica de Catalunya
 
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Universitat Politècnica de Catalunya
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Learning spatiotemporal features with 3 d convolutional networks
Learning spatiotemporal features with 3 d convolutional networksLearning spatiotemporal features with 3 d convolutional networks
Learning spatiotemporal features with 3 d convolutional networksSungminYou
 

What's hot (20)

Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019
Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019
Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019
 
Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
 
Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)Deep Learning for Video: Language (UPC 2018)
Deep Learning for Video: Language (UPC 2018)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
 
One Perceptron to Rule Them All: Language and Vision
One Perceptron to Rule Them All: Language and VisionOne Perceptron to Rule Them All: Language and Vision
One Perceptron to Rule Them All: Language and Vision
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning for Video: Object Tracking (UPC 2018)
Deep Learning for Video: Object Tracking (UPC 2018)Deep Learning for Video: Object Tracking (UPC 2018)
Deep Learning for Video: Object Tracking (UPC 2018)
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
 
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Learning spatiotemporal features with 3 d convolutional networks
Learning spatiotemporal features with 3 d convolutional networksLearning spatiotemporal features with 3 d convolutional networks
Learning spatiotemporal features with 3 d convolutional networks
 

Similar to Neural Architectures for Video Encoding

Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)
Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)
Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Universitat Politècnica de Catalunya
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsPetteriTeikariPhD
 
IRJET- Identification of Missing Person in the Crowd using Pretrained Neu...
IRJET-  	  Identification of Missing Person in the Crowd using Pretrained Neu...IRJET-  	  Identification of Missing Person in the Crowd using Pretrained Neu...
IRJET- Identification of Missing Person in the Crowd using Pretrained Neu...IRJET Journal
 
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...Luba Elliott
 
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...thanhdowork
 
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...Vignesh V Menon
 
U_N.o.1T: A U-Net exploration, in Depth
U_N.o.1T: A U-Net exploration, in DepthU_N.o.1T: A U-Net exploration, in Depth
U_N.o.1T: A U-Net exploration, in DepthManuel Nieves Sáez
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...CSCJournals
 
Hierarchical structure adaptive
Hierarchical structure adaptiveHierarchical structure adaptive
Hierarchical structure adaptiveNEERAJ BAGHEL
 
VIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTM
VIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTMVIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTM
VIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTMIRJET Journal
 
Steganography using Coefficient Replacement and Adaptive Scaling based on DTCWT
Steganography using Coefficient Replacement and Adaptive Scaling based on DTCWTSteganography using Coefficient Replacement and Adaptive Scaling based on DTCWT
Steganography using Coefficient Replacement and Adaptive Scaling based on DTCWTCSCJournals
 
Image Compression Using Hybrid Svd Wdr And Svd Aswdr
Image Compression Using Hybrid Svd Wdr And Svd AswdrImage Compression Using Hybrid Svd Wdr And Svd Aswdr
Image Compression Using Hybrid Svd Wdr And Svd AswdrMelanie Smith
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 

Similar to Neural Architectures for Video Encoding (20)

Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)
Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)
Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)
 
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Deep Learning for Computer Vision: Video Analytics (UPC 2016)Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
 
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
 
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problems
 
IRJET- Identification of Missing Person in the Crowd using Pretrained Neu...
IRJET-  	  Identification of Missing Person in the Crowd using Pretrained Neu...IRJET-  	  Identification of Missing Person in the Crowd using Pretrained Neu...
IRJET- Identification of Missing Person in the Crowd using Pretrained Neu...
 
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
 
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
 
1918 1923
1918 19231918 1923
1918 1923
 
1918 1923
1918 19231918 1923
1918 1923
 
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
 
A0540106
A0540106A0540106
A0540106
 
U_N.o.1T: A U-Net exploration, in Depth
U_N.o.1T: A U-Net exploration, in DepthU_N.o.1T: A U-Net exploration, in Depth
U_N.o.1T: A U-Net exploration, in Depth
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
 
Hierarchical structure adaptive
Hierarchical structure adaptiveHierarchical structure adaptive
Hierarchical structure adaptive
 
VIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTM
VIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTMVIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTM
VIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTM
 
Steganography using Coefficient Replacement and Adaptive Scaling based on DTCWT
Steganography using Coefficient Replacement and Adaptive Scaling based on DTCWTSteganography using Coefficient Replacement and Adaptive Scaling based on DTCWT
Steganography using Coefficient Replacement and Adaptive Scaling based on DTCWT
 
Image Compression Using Hybrid Svd Wdr And Svd Aswdr
Image Compression Using Hybrid Svd Wdr And Svd AswdrImage Compression Using Hybrid Svd Wdr And Svd Aswdr
Image Compression Using Hybrid Svd Wdr And Svd Aswdr
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 

More from Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Universitat Politècnica de Catalunya
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Universitat Politècnica de Catalunya
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic ModerationHate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic ModerationUniversitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
 
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic ModerationHate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
 

Recently uploaded

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 

Recently uploaded (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 

Neural Architectures for Video Encoding

  • 1. [http://pagines.uab.cat/mcv/] Module 6 - Day 2 - Lecture 1 Neural Architectures for Video Encoding 27th February 2020 Xavier Giro-i-Nieto @DocXavi xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Barcelona Supercomputing Center
  • 2. Acknowledgements 2 Víctor Campos Alberto MontesAmaia Salvador Santiago Pascual
  • 3. Video lectures 3 Víctor Campos Deep Learning for Computer Vision 2018 UPC TelecomBCN Xavier Giro-i-Nieto Deep Learning for Computer Vision 2019 UPC TelecomBCN
  • 4. 4 #DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014
  • 5. Motivation 5 #DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. Large-scale video classification with convolutional neural networks. CVPR 2014.
  • 6. What is a video? 6 ● Formally, a video is a 3D signal ○ Spatial coordinates: x, y ○ Temporal coordinate: t ● If we fix t, we obtain an image. We can understand videos as sequences of images (a.k.a. frames)
  • 7. Convolutional Neural Networks (CNN) provide state of the art performance on still image analysis tasks How do we work with images? 7
  • 8. How can we extend CNNs to image sequences? How do we work with videos ? 8 f ŷ
  • 9. 9 Basic deep architectures for video: 1. Single frame models 2. Spatio-temporal convolutions 3. CNN + RNN 4. RGB + Optical Flow 5. Dual slow and fast frame rates 6. Capsule Networks Deep Video Architectures
  • 10. 10 Basic deep architectures for video: 1. Single frame models 2. Spatio-temporal convolutions 3. CNN + RNN 4. RGB + Optical Flow 5. Dual slow and fast frame rates 6. Capsule Networks Deep Video Architectures
  • 11. Single frame models 11 CNN CNN CNN... Combination method Combination is commonly implemented as a small NN on top of a temporal pooling operation (e.g. max, average). Problem: pooling is not aware of the temporal order! Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015 ŷ
  • 12. 12 #DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014 Single frame models
  • 13. 13 Single frame models #DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014
  • 14. 14 #DeepVideo Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014 Single frame models
  • 15. 15 Basic deep architectures for video: 1. Single frame models 2. Spatio-temporal convolutions 3. CNN + RNN 4. RGB + Optical Flow Two-stream CNN Deep Video Architectures
  • 16. 3D CNN (C3D) 16 We can add an extra dimension to standard 2D CNN filters: ● A grayscale video clip may be represented by a tensor of size H x W x L. ● A 3D conv filter for the 1st conv layer may be of size k x k x d #C3D Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks" ICCV 2015 time
  • 17. 17 The video needs to be split into chunks (also known as clips) with a number of frames that fits the receptive field of the C3D (eg. 16 frames). #C3D Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks" ICCV 2015 16-frame clip 16-frame clip 16-frame clip ... Average pooling 4096-dimvideodescriptor 4096-dimvideodescriptor L2 norm 3D CNN (C3D) + temporal pooling
  • 18. 18 3D CNN (C3D) #3D Resnet Hara, Kensho, Hirokatsu Kataoka, and Yutaka Satoh. "Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?." CVPR 2018. [code] Much larger datasets (eg. Kinetics) allow training much deeper C3D CNNs.
  • 19. 19 Limitation: ● How can we use pre-trained 2D networks to initialize C3D training ? 3D CNN
  • 20. 20 Inflated 3D weights (I3D) We can adapt C2D filters to C3D by replicating them along the temporal axis (and scaling its values by 1/d). inflation #I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code] k x k C2D filter I3D filter k x k x d Filter inflation allows reusing C2D pre-trained filters (eg. from ImageNet) to initialize C3D filters.
  • 21. 21 Residual Learning Residual learning: reformulate the layers as learning residual functions with respect to the identity F(x), instead of learning unreferenced functions H(x) He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." CVPR 2016. [slides] Plain Net Residual Net
  • 22. 22 #Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code] 3D CNN + Residual (Res3D) 2D ResNet Res3D
  • 23. 23 #Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code] 3D CNN + Residual (Res3D)
  • 24. 24 #Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code] 3D CNN + Residual (Res3D) Gain in performance comes with the cost of much more computation and larger memory footprint.
  • 25. 25 #MC Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code] Mixed Convolutions (MC & rMC) Mixed Convolutions (MC) ● Early layers: 3D convolutions ● Top layers: 2D convolutions Reversed Mixed Convolutions (rMC) ● Early layers: 2D convolutions ● Top layers: 3D convolutions
  • 26. 26 #MC Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code] MC ResNets: ● 3-4% gain in clip-level accuracy over 2D ResNets of comparable capacity ● match the performance of 3D ResNets, which have 3 times as many parameters. Mixed Convolutions (MC & rMC)
  • 27. 27 #R(2+1)D Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code] C(2+1)D Splits the computation into a spatial 2D convolution followed by a temporal 1D convolution:. In the case of an input of a single feature channel: C3D C(2+1)D Number of 2D filters
  • 28. 28 #R(2+1)D Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code] C(2+1D) + Residual
  • 29. 29 #R(2+1)D Tran, Du, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. "A closer look at spatiotemporal convolutions for action recognition." CVPR 2018. [code] 2+1D CNN + Residual
  • 30. 30 Separable 2D (group) convolutions Depth-wise (in channels) separable convolutions. #XCeption Chollet, François. "Xception: Deep learning with depthwise separable convolutions." CVPR 2017. Figure: Sik-Ho Tsang
  • 31. 31 Res3D + Separable (group) convolutions #CSN D. Tran, H. Wang, L. Torresani and M. Feiszli. Video Classification with Channel-Separated Convolutional Networks. ICCV 2019. [Caffe2] Channel interactions and spatiotemporal interactions are separated. 1 group 2 groups # groups = # input channels
  • 32. 32 Basic deep architectures for video: 1. Single frame models 2. Spatio-temporal convolutions 3. CNN + RNN 4. RGB + Optical Flow 5. Dual slow and fast frame rates 6. Capsule Networks Deep Video Architectures
  • 33. 33 The hidden layers and the output depend from previous states of the hidden layers Recurrent Neural Network (RNN)
  • 34. 2D CNN + RNN 34 CNN CNN CNN... Recurrent Neural Networks are well suited for processing sequences. Problem: RNNs are sequential and cannot be parallelized. RNN RNN RNN... Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code Videolectures on RNNs: DLSL 2017, “RNN (I)” “RNN (II)” DLAI 2018, “RNN”
  • 35. 35 #Res3D Tran, Du, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. "Convnet architecture search for spatiotemporal feature learning." arXiv preprint arXiv:1708.05038 (2017). [code] 2D CNN + RNN vs Spatio-temporal Convs
  • 36. 36 Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. [code] 2D CNN + RNN
  • 37. 37 Limitations: ● How can we handle longer videos? ● How can we capture longer temporal dependencies? 3D CNN
  • 38. 38 3D CNN + RNN A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks”, NIPS Workshop 2016 (best poster award)
  • 39. 39 #SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018. 2D CNN + RNN: skip redundancy CNN CNN CNN... RNN RNN RNN... After processing a frame, let the RNN decide how many future frames can be skipped In skipped frames, simply copy the output and state from the previous time step There is no ground truth for which frames can be skipped. The RNN learns it by itself during training!
  • 40. 40 #SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018. 2D CNN + RNN Epochs for Skip LSTM (λ = 10-4 ) ~30% acc ~50% acc ~70% acc ~95% acc 11 epochs 51 epochs 101 epochs 400 epochs Used Unused
  • 41. 41 Used Unused 2D CNN + RNN: skip redundancy #SkipRNN Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
  • 42. 42 Basic deep architectures for video: 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. RGB + Optical Flow 5. Dual slow and fast frame rates 6. Capsule Networks Deep Video Architectures
  • 43. 43 Ilg, Eddy, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. "Flownet 2.0: Evolution of optical flow estimation with deep networks." CVPR 2017. [code]
  • 44. Optical flow 44 Popular implementations for optical flow: #Coarse2Fine Brox, Thomas, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. "High accuracy optical flow estimation based on a theory for warping." ECCV 2004. ● PyFlow ● FlowNet2 ● Improved Dense Trajectories - IDT ● Lucas-Kanade (OpenCV) ● Farneback 2003 (OpenCV) ● Middlebury ● Horn and schunk ● Tikhonov regularized and vectorized ● DeepFlow (2013) ● (...)
  • 45. Optical flow 45 Deep Neural Networks have actually also been trained to predict optical flow: Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., FlowNet: Learning Optical Flow With Convolutional Networks. ICCV 2015
  • 46. Two-streams 2D CNNs 46 Problem: Single frame models do not take into account motion in videos. Solution: extract optical flow for a stack of frames and use it as an input to a CNN. Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." NIPS 2014. Fusion
  • 47. Two-streams 2D CNNs 47Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code] Fusion Problem: Single frame models do not take into account motion in videos. Solution: extract optical flow for a stack of frames and use it as an input to a CNN.
  • 48. Two-streams 2D CNNs + Residual 48 Feichtenhofer, Christoph, Axel Pinz, and Richard Wildes. "Spatiotemporal residual networks for video action recognition." NIPS 2016. [code] Problem: Single frame models do not take into account motion in videos. Solution: Extract optical flow for a stack of frames and feed it features in the appearance stream.
  • 49. Two-streams + Inflated 3D 49#I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
  • 50. Two-streams 3D CNNs 50#I3D Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
  • 51. Two-streams 2D CNNs + RNN 51 Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
  • 52. One stream RGB + Optical Flow 52 #2in1 Zhao, Jiaojiao, and Cees GM Snoek. "Dance with Flow: Two-in-One Stream Action Detection." CVPR 2019. Leveraging the motion condition to modulate RGB features improves detection accuracy.
  • 53. One stream RGB + Optical Flow 53 #2in1 Zhao, Jiaojiao, and Cees GM Snoek. "Dance with Flow: Two-in-One Stream Action Detection." CVPR 2019.
  • 54. One stream RGB + Optical Flow 54 #2in1 Zhao, Jiaojiao, and Cees GM Snoek. "Dance with Flow: Two-in-One Stream Action Detection." CVPR 2019. 2in1 has: ● half the computation and parameters of a two-stream equivalent ● better action detection accuracy (but worse for action classification).
  • 55. Two-streams (feature visualization) 55 Feichtenhofer, Christoph, Axel Pinz, Richard P. Wildes, and Andrew Zisserman. "What have we learned from deep representations for action recognition?." CVPR 2018. Feichtenhofer, Christoph, Axel Pinz, Richard P. Wildes, and Andrew Zisserman. "Deep Insights into Convolutional Networks for Video Recognition." IJCV 2019.
  • 56. Replacing C3D for C2D 56#S3D Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K.. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. ECCV 2018
  • 57. 57 Basic deep architectures for video: 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. RGB + Optical Flow 5. Dual slow and fast frame rates 6. Capsule Networks Deep Video Architectures
  • 58. 58 Dual slow and fast frame rates #SlowFast Feichtenhofer, Christoph, Haoqi Fan, Jitendra Malik, and Kaiming He. "Slowfast networks for video recognition." ICCV 2019. [blog] [code] 2 fps 16 fps High capacity model Low capacity model
  • 59. 59 Dual slow and fast frame rates #SlowFast Feichtenhofer, Christoph, Haoqi Fan, Jitendra Malik, and Kaiming He. "Slowfast networks for video recognition." ICCV 2019. [blog] [code] Comparison with the state of the art on Kinetics-400
  • 60. 60 Dual slow and fast frame rates #AVSlowFast Xiao, Fanyi, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. "Audiovisual SlowFast Networks for Video Recognition." arXiv preprint arXiv:2001.08740 (2020).
  • 61. 61 Basic deep architectures for video: 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. RGB + Optical Flow Two-stream CNN 5. Dual slow and fast frame rates 6. Capsule Networks Deep Video Architectures
  • 62. 62 Capsule Networks #Videocapsulenet Duarte, Kevin, Yogesh Rawat, and Mubarak Shah. "Videocapsulenet: A simplified network for action detection." NeurIPS 2018. [code]
  • 63. 63 Basic deep architectures for video: 1. Single frame models 2. CNN + RNN 3. 3D convolutions 4. RGB + Optical Flow 5. Dual slow and fast frame rates 6. Capsule Networks Deep Video Architectures
  • 64. 64 Implementations PyTorch ● Facebook FAIR ○ SlowFast ○ SlowOnly ○ C2D ○ I3D ○ Non-local Network ● Torchvision ○ ResNet 3D 18 ○ ResNet MC 18 ○ ResNet(2+1)D
  • 65. 65 Learn more Rohit Ghosh, “Deep Learning for Videos: A 2018 Guide for Action recognition” (2018)
  • 66. 66 Related lectures [index] Object tracking [slides] [video] Video Object segmentation [slides] [video] Self-supervised Learning from Video Sequences [slides] [video] Self-supervised Audiovisual Learning [slides] [video]