This document summarizes an online course on deep learning from videos. It outlines the course content, which includes modules on unsupervised learning, predictive learning, self-supervised learning, and cross-modal learning. It provides examples of papers and techniques for each module, with a focus on using deep learning to learn visual and audio representations from unlabeled video data in an unsupervised or self-supervised manner.
9. 9
Greff, Klaus, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and Juergen Schmidhuber. "Tagger: Deep unsupervised
perceptual grouping." NIPS 2016 [video] [code]
10. Unsupervised Learning
10
Why Unsupervised Learning?
● It is the nature of how intelligent beings percept
the world.
● It can save us tons of efforts to build a
human-alike intelligent agent compared to a
totally supervised fashion.
● Vast amounts of unlabelled data.
11. Max margin decision boundary
11
How data distribution P(x) influences decisions (1D)
Slide credit: Kevin McGuinness (DLCV UPC 2017)
13. 13
How data distribution P(x) influences decisions (2D)
Slide credit: Kevin McGuinness (DLCV UPC 2017)
14. 14
How data distribution P(x) influences decisions (2D)
Slide credit: Kevin McGuinness (DLCV UPC 2017)
15. x1
x2
Red / Blue classes are nt
linearly separable in this
2D space :(
15
How clustering is valuable for linear classifiers
Slide credit: Kevin McGuinness (DLCV UPC 2017)
16. x1
x2
Cluster 1 Cluster 2
Cluster 3
Cluster 4
1 2 3 4
1 2 3 4
4D BoW representation
(“histogram”)
Red / Blue classes are now
separable with a linear
classifiers in a 4D space :) 16
How clustering is valuable for linear classifiers
Slide credit: Kevin McGuinness (DLCV UPC 2017)
17. Example from Kevin McGuinness Juyter notebook.
17
How clustering is valueble for linear classifiers
22. 22
Autoencoder (AE)
“Deep Learning Tutorial”, Dept. Computer Science, Stanford
Autoencoders:
● Predict at the output the
same input data.
● Do not need labels:
23. 23
Autoencoder (AE)
“Deep Learning Tutorial”, Dept. Computer Science, Stanford
Dimensionality reduction:
● Use hidden layer as
a feature extractor of
any desired size.
Application #1
25. 25
Autoencoder (AE) Application #2
Pretraining:
1. Initialize a NN solving an autoencoding
problem.
2. Train for final task with “few” labels.
Figure: Kevin McGuinness (DLCV UPC 2017)
Encoder
W1
hdata Classifier
WC
Latent variables
(representation/features)
prediction
y Loss
(cross entropy)
29. Frame Reconstruction & Prediction
29
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video
Representations using LSTMs." In ICML 2015. [Github]
Unsupervised feature learning (no labels) for...
30. Frame Reconstruction & Prediction
30
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video
Representations using LSTMs." In ICML 2015. [Github]
Unsupervised feature learning (no labels) for...
...frame prediction.
31. Frame Reconstruction & Prediction
31
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video
Representations using LSTMs." In ICML 2015. [Github]
Unsupervised feature learning (no labels) for...
...frame prediction.
32. 32
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video
Representations using LSTMs." In ICML 2015. [Github]
Unsupervised learned features (lots of data) are
fine-tuned for activity recognition (little data).
Frame Reconstruction & Prediction
33. Frame Prediction
33
Ranzato, MarcAurelio, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. "Video (language)
modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).
34. 34
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
Video frame prediction with a ConvNet.
Frame Prediction
35. 35
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
The blurry predictions from MSE are improved with multi-scale architecture,
adversarial learning and an image gradient difference loss function.
Frame Prediction
36. 36
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
The blurry predictions from MSE (l1) are improved with multi-scale architecture,
adversarial training and an image gradient difference loss (GDL) function.
Frame Prediction
37. 37
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
Frame Prediction
38. 38Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016.
Frame Prediction
39. 39Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016.
40. 40
Xue, Tianfan, Jiajun Wu, Katherine Bouman, and Bill Freeman. "Visual dynamics: Probabilistic future frame
synthesis via cross convolutional networks." NIPS 2016 [video]
41. 41
Frame Prediction
Xue, Tianfan, Jiajun Wu, Katherine Bouman, and Bill Freeman. "Visual dynamics: Probabilistic future
frame synthesis via cross convolutional networks." NIPS 2016 [video]
Given an input image, probabilistic generation of future frames with a Variational
AutoEncoder (VAE).
42. 42
Frame Prediction
Xue, Tianfan, Jiajun Wu, Katherine Bouman, and Bill Freeman. "Visual dynamics: Probabilistic future
frame synthesis via cross convolutional networks." NIPS 2016 [video]
Encodes image as feature maps, and motion as and cross-convolutional kernels.
44. First steps in video feature learning
44
Le, Quoc V., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant
spatio-temporal features for action recognition with independent subspace analysis." CVPR 2011
45. Temporal Weak Labels
45
Goroshin, Ross, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. "Unsupervised learning
of spatiotemporally coherent metrics." ICCV 2015.
Assumption: adjacent video frames contain semantically similar information.
Autoencoder trained with regularizations by slowliness and sparisty.
46. 46
Jayaraman, Dinesh, and Kristen Grauman. "Slow and steady feature analysis: higher order temporal
coherence in video." CVPR 2016. [video]
●
●
Temporal Weak Labels
47. 47
(Slides by Xunyu Lin): Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code]
Temporal order of frames is
exploited as the supervisory
signal for learning.
Temporal Weak Labels
48. 48
(Slides by Xunyu Lin): Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code]
Take temporal order as the supervisory signals for learning
Shuffled
sequences
Binary classification
In order
Not in order
Temporal Weak Labels
49. 49
Fernando, Basura, Hakan Bilen, Efstratios Gavves, and Stephen Gould. "Self-supervised video representation learning with
odd-one-out networks." CVPR 2017
Temporal Weak Labels
Train a network to detect which of the video sequences contains frames wrong order.
50. 50
Spatio-Temporal Weak Labels
X. Lin, Campos, V., Giró-i-Nieto, X., Torres, J., and Canton-Ferrer, C., “Disentangling Motion, Foreground
and Background Features in Videos”, in CVPR 2017 Workshop Brave New Motion Representations
kernel dec
C3D
Foreground
Motion
First
Foreground
Background
Fg
Dec
Bg
Dec
Fg
Dec
Reconstruction
of foreground in
last frame
Reconstruction
of foreground in
first frame
Reconstruction
of background
in first frame
uNLC
Mask
Block
gradients
Last
foreground
Kernels
share
weights
51. 51
Spatio-Temporal Weak Labels
Pathak, Deepak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. "Learning features by watching
objects move." CVPR 2017
52. 52
Greff, Klaus, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and Juergen Schmidhuber. "Tagger: Deep unsupervised
perceptual grouping." NIPS 2016 [video] [code]
Spatio-Temporal Weak Labels
58. 58
Visual Feature Learning
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Based on the assumption that ambient sound in video is related to the visual
semantics.
59. 59
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Use videos to train a CNN that predicts the audio statistics of a frame.
Visual Feature Learning
60. 60
Task: Use the predicted audio stats to clusters images. Audio clusters built with
K-means over training set
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Average statsCluster assignments at test time (one row=one cluster)
Visual Feature Learning
61. 61
Although the CNN was not trained with class labels, local units with semantic meaning
emerge.
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Visual Feature Learning
63. 63
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016.
64. 64
Audio Feature Learning: SoundNet
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Pretrained visual ConvNets supervise the training of a model for sound representation
65. 65
Videos for training are unlabeled. Relies on Convnets trained on labeled images.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
66. 66
Hidden layers of Soundnet are used to train a standard SVM
classifier that outperforms state of the art.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
67. 67
Visualization of the 1D filters over raw audio in conv1.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
68. 68
Visualization of the 1D filters over raw audio in conv1.
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
69. 69
Visualize samples that mostly activate a neuron in a late layer (conv7)
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
70. 70
Visualization of the video frames associated to the sounds that activate some of the
last hidden units (conv7):
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
71. 71
Hearing sounds that most activate a neuron in the sound network (conv7)
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
72. 72
Hearing sounds that most activate a neuron in the sound network (conv5)
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Audio Feature Learning: SoundNet
74. 7474Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Audio and visual features learned by assessing correspondence.
Audio & Visual Feature Learning
75. 75Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." arXiv preprint arXiv:1712.06651 (2017).
Audio & Visual Feature Learning
76. 76Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." arXiv preprint arXiv:1712.06651 (2017).
Audio & Visual Feature Learning
77. 77Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." arXiv preprint arXiv:1712.06651 (2017).
79. 79
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
80. 80
Cross-modal Retrieval
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Learn synthesized sounds from videos of people hitting objects with a drumstick.
81. 81
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Not end-to-end
Cross-modal Retrieval
82. 82
The Greatest Hits Dataset
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
Cross-modal Retrieval
83. 83
[Paper draft]
Cross-modal Retrieval
Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018).
84. 84
Best
match
Visual feature Audio feature
Video sonorization
Cross-modal Retrieval
Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018).
85. 85
Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal
Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018).
Visual feature Audio feature
Best
match
Audio coloring
Cross-modal Retrieval
86. 86
Hong, Sungeun, Woobin Im, and Hyun S. Yang. "Deep Learning for Content-Based, Cross-Modal
Retrieval of Videos and Music." (submitted)
87. 87
Cross-modal Retrieval
Hong, Sungeun, Woobin Im, and Hyun S. Yang. "Deep Learning for Content-Based, Cross-Modal Retrieval of Videos and
Music." 2017 (submitted)
91. 91
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In
ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
92. 92Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis
93. 93
Speech Generation from Video
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In
ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
95. 95Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
96. 96Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
Speech to Video Synthesis (mouth)
97. 97
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
98. 98
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.
Speech to Video Synthesis (mouth)
99. 99
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
100. 100
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
Speech to Video Synthesis (pose & emotion)
101. 101
L. Chen, S. Srivastava, Z. Duan and C. Xu. Deep Cross-Modal Audio-Visual Generation. ACM
Multimedia Thematic Workshops 2017.
Audio & Visual Generation