Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)

@DocXavi
Module 3 - Lecture 10
Deep Convnets for
Video Processing
28 January 2016
Xavier Giró-i-Nieto
[http://pagines.uab.cat/mcv/]

Outline
1. Recognition
2. Optical Flow
3. Object Tracking
4. Learn more
6

Recognition
Demo: Clarifai
MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015)
7

Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with
convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.
8
Recognition

9
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D
convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015

10
Recognition
Previous lectures
with Jose M. Álvarez

11
Recognition

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video
classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014
IEEE Conference on (pp. 1725-1732). IEEE.
Slides extracted from ReadCV seminar by Victor Campos 12
Recognition: DeepVideo

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional
neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 13
Recognition: DeepVideo: Demo

Recognition: DeepVideo: Architectures

Unsupervised learning [Le at al’11] Supervised learning [Karpathy et al’14]
Recognition: DeepVideo: Features

Recognition: DeepVideo: Multiscale

Recognition: DeepVideo: Results

18
Recognition

19
Recognition: C3D
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International
Conference on Computer Vision, pp. 4489-4497. 2015

20
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks."
In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Demo

21
K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ICLR 2015.
Recognition: C3D: Spatial dimension
Spatial dimensions (XY) of the used kernels are fixed to 3x3, following Symonian & Zisserman (ICLR 2015).

22
Recognition: C3D: Temporal dimension
3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets
Temporal depth
2D ConvNets

23
A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best
performing architectures for 3D ConvNets

24
No gain when varying the temporal depth across layers.

25
No gain when varying the temporal depth across layers.
Recognition: C3D: Architecture
Feature
vector

26
Recognition: C3D: Feature vector
Video sequence
16 frames-long clips
8 frames-long overlap

27
Recognition: C3D: Feature vector
16-frame clip
16-frame clip
16-frame clip
16-frame clip
...
Average
4096-dimvideodescriptor
4096-dimvideodescriptor
L2 norm

28
Recognition: C3D: Visualization
Based on Deconvnets by Zeiler and Fergus [ECCV 2014] - See [ReadCV Slides] for more details.

29
Recognition: C3D: Compactness

30
Convolutional 3D(C3D) combined with a simple linear classifier outperforms state-of-the-art methods on 4
different benchmarks and are comparable with state of the art methods on other 2 benchmarks
Recognition: C3D: Performance

31
Recognition: C3D: Software
Implementation by Michael Gygli (GitHub)

32
Recognition: ImageNet Video
[ILSVRC 2015 Slides and videos]

33

34

35

36
Kai Kang et al, Object Detection in Videos with TubeLets and Multi-Context Cues (ILSVRC 2015) [video] [poster]

37

38

39

Optical Flow
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In
Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 40

Optical Flow: Small vs Large

Optical Flow
Classic approach:
Rigid matching of HoG or
SIFT descriptors
Deep Matching:
Allow each subpatch to move:
● independently
● in a limited range
depending on its size

Optical Flow: Deep Matching

Source: Matlab R2015b documentation for normxcorr2 by Mathworks
44
Optical Flow: 2D correlation
Image
Sub-Image
Offset of the sub-image with respect to the image [0,0].

Instead of pre-trained filters, a
convolution is defined between
each:
● patch of the reference image
● target image
...as a results, a correlation map is
generated for each reference
patch.

The most
discriminative
response map
The less
discriminative
response map

Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search.
4x4
patches
8x8 patches
16x16 patches
32x32 patches
Top-down
matching
(TD)Bottom-up
extraction
(BU)

4x4
patches
8x8 patches
16x16 patches
32x32 patches
Bottom-up
extraction
(BU)

Optical Flow: Deep Matching (BU)

Optical Flow: Deep Matching (TD)
4x4
patches
8x8 patches
16x16 patches
32x32 patches
Top-down
matching
(TD)

Each local maxima in the top layer corresponds to a shift of one of the biggest (32x32) patches.
If we focus on local maximum, we can retrieve the corresponding responses one scale below and focus on
shift of the sub-patches that generated it

Ground truth
Dense HOG
[Brox & Malik 2011]
Deep Matching

Optical Flow
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning
Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 56

Optical Flow: FlowNet

End to end supervised learning of optical flow.

Optical Flow: FlowNet (contracting)
Option A: Stack both input images together and feed them through a generic network.

Option B: Create two separate, yet identical processing streams for the two images and combine them at a
later stage.

Option B: Create two separate, yet identical processing streams for the two images and combine them at a
later stage.
Correlation layer:
Convolution of data patches from the layers to combine.

Optical Flow: FlowNet (expanding)
Upconvolutional layers: Unpooling features maps + convolution.
Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.

Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset
is generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise;
changes in brightness, contrast, gamma and color).
Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI.
Data
augmentation

Object tracking: MDNet
65
Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)

Object tracking: MDNet
66

Object tracking: MDNet: Architecture
67
Domain-specific layers are used during training for each sequence, but are replaced by a single one at test
time.

Object tracking: MDNet: Online update
68
MDNet is updated online at test
time with hard negative mining,
that is, selecting negative
samples with the highest positive
score.

Object tracking: FCNT
69
Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE
International Conference on Computer Vision, pp. 3119-3127. 2015 [code]

Object tracking: FCNT
70
Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification.
conv4-3 conv5-3

Object tracking: FCNT: Specialization
71
Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking
sequence.

Object tracking: FCNT: Localization
72
Although trained for image classification, feature maps in conv5-3 enable object localization…
...but is not discriminative enough to different objects of the same category.

Object tracking: Localization
73
Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015.
[Zhou et al, ICLR 2015] “Object detectors emerge in deep scene CNNs” [Slides from ReadCV]

Object tracking: FCNT: Localization
74
On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation…
conv4-3 conv5-3

Object tracking: FCNT: Architecture
75
SNet=Specific Network (online update)
GNet=General Network (fixed)

Object tracking: FCNT: Results
76

ConvNets: Software
Caffe http://caffe.berkeleyvision.org/
Torch (Overfeat) http://torch.ch/
Theano http://deeplearning.net/software/theano/
Tensor Flow https://www.tensorflow.org/
MatconvNet (VLFeat) http://www.vlfeat.org/matconvnet/
CNTK (Mcrosoft) http://www.cntk.ai/
77

Seminar Series:
Compacting ConvNets
for End to End Learning
Tuesday February 2, 4pm
D5-010 Campus Nord
ConvNets: Learn more
78
Jose M Álvarez

Stanford course:
CS231n:
Convolutional Neural
Networks for Visual
Recognition
79

Online course:
Deep Learning
Taking machine
learning to the next
level
80

ReadCV seminar
Friendly reviews of SoA papers
Spring 2016:
Tuesdays at 11am
81

Barcelona
Convolucionada:
Deep Learning a l’abast
de tothom
Monday, February 1, 7pm @ FIB,
Campus Nord UPC
82
Grup d’estudi de machine learning
Barcelona

Summer course
Deep Learning for
Computer Vision
(2.5 ECTS for MSc & Phd)
July 4-8, 3-7pm
83

● Deep learning methos for vision (CVPR 2012)
● Tutorial on deep learning for vision (CVPR 2014)
● Kyunghyun Cho, “Deep Learning: Past, Present & Future”
84

85
“Machine learning” sub-Reddit.

87
Check profile requirements for Summer internship (disclaimer: offered to Phd students by default)
Company Avg Salary / hour Avg Salary / month
Yahoo $43 ($43x160=$6,880)
Apple $37 ($37x160=$5,920)
Google $29.54-$31.32 $7,151
Facebook $22.92 $6,150-$7,378
Microsoft $22.63 $6,506-$7,171
Source: Glassdoor.com (internships in California. No stipends included)

88
Video: Cristian Canton’s talk “From Catalonia to America: notes on how to achieve a successful post-Phd
career ”@ ACMCV 2015 & UPC

Li Fei-Fei, “How we’re teaching
computers to understand pictures”
TEDTalks 2014.
89

Jeremy Howard, “The wonderful
and terrifying implications of
computers that can learn”,
TEDTalks 2014.
90

91
● Neil Lawrence, OpenAI won’t benefit humanity without open data sharing
(The Guardian, 14/12/2015)

Is Computer
Vision solved ?
ConvNets: Discussion
92

ConvNets: Do you know them ?
94
Antonio Torralba, MIT
(former UPC)
...and MANY MORE I am missing in the page (apologies).
Oriol Vinyals, Google
(former UPC)
Jose M Álvarez, NICTA
(former URL & UAB)
Joan Bruna, Berkeley
(former UPC)

95
ConvNets: Where you are studying
VisioCat dinner
@ CVPR 2015

Considering a Phd at GPI-UPC ?
Currently, no direct funding available (check in the future).
We can support your application to scholarships:
External grant listings: UPC, UPF
Funding institution Last deadlines
(on 28/1/2016)
FI (Catalonia) 22/09/2015
FPU (Spain) 15/01/2016
Check our activity at https://imatge.upc.edu/web/ 96

Image Classification
97
Our past research
A. Salvador, Zeppelzauer, M., Manchon-Vizuete, D., Calafell-Orós, A., and Giró-i-Nieto, X., “Cultural Event Recognition with Visual ConvNets and Temporal
Models”, in CVPR ChaLearn Looking at People Workshop 2015, 2015. [slides]
ChaLearn Worshop

Saliency Prediction
J. Pan and Giró-i-Nieto, X., “End-to-end Convolutional Network for Saliency Prediction”, in Large-scale Scene Understanding Challenge (LSUN) at CVPR
Workshops , Boston, MA (USA), 2015. [Slides]
98
Our current research
LSUN Challenge

Sentiment Analysis
99
[Slides]
CNN
V. Campos, Salvador, A., Jou, B., and Giró-i-Nieto, X., “Diving Deep into Sentiment: Understanding Fine-tuned CNNs for Visual Sentiment Prediction”, in 1st
International Workshop on Affect and Sentiment in Multimedia, Brisbane, Australia, 2015.

Instance Search in Video
100
V. - T. Nguyen, -Dinh-Le, D., Salvador, A., -Zhu, C., Nguyen, D. - L., Tran, M. - T., Duc, T. Ngo, Duong, D. Anh, Satoh, S. 'ichi, and Giró-i-Nieto, X., “NII-HITACHI-
UIT at TRECVID 2015 Instance Search”, in TRECVID 2015 Workshop, Gaithersburg, MD, USA, 2015.
K. McGuinness, Mohedano, E., Salvador, A., Zhang, Z. X., Marsden, M., Wang, P., Jargalsaikhan, I., Antony, J., Giró-i-Nieto, X., Satoh, S. 'ichi, O'Connor, N., and
Smeaton, A. F., “Insight DCU at TRECVID 2015”, in TRECVID 2015 Workshop, Gaithersburg, MD, USA, 2015.
...

Thank you !
Slides available on and .
https://imatge.upc.edu/web/people/xavier-giro
http://bitsearch.blogspot.com
https://twitter.com/DocXavi
https://www.facebook.com/ProfessorXavi
xavier.giro@upc.edu
101

Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)

Similar to Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016) (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)