Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Learning without
Annotations
Day 4 Lecture 2
#DLUPC
http://bit.ly/dlai2018
2
Outline
1. Unsupervised Learning
2. Semi-supervised Learning
3. Self-supervised Learning
3
Outline
1. Unsupervised Learning
2. Semi-supervised Learning
3. Self-supervised Learning
4
Acknowledgements
Kevin McGuinness
kevin.mcguinness@dcu.ie
Research Fellow
Insight Centre for Data Analytics
Dublin City University
5
Video lectures
Kevin McGuinness, DLCV 2016 Xavier Giró, DLAI 2017
Types of machine learning
Yann Lecun’s Black Forest cake
6
7
Types of machine learning
We can categorize three types of learning procedures:
1. Supervised Learning:
𝐲 = ƒ(𝐱)
2. Unsupervised Learning:
ƒ(𝐱)
3. Reinforcement Learning (RL):
𝐲 = ƒ(𝐱)
𝐳
Predict label y corresponding to
observation x
Estimate the distribution of
observation x
Predict action y based on
observation x, to maximize a future
reward z
8
Types of machine learning
We can categorize three types of learning procedures:
1. Supervised Learning:
𝐲 = ƒ(𝐱)
2. Unsupervised Learning:
ƒ(𝐱)
3. Reinforcement Learning (RL):
𝐲 = ƒ(𝐱)
𝐳
9
Why Unsupervised Learning ?
● It is the nature of how intelligent beings percept
the world.
● It can save us tons of efforts to build a
human-alike intelligent agent compared to a
totally supervised fashion.
● Vast amounts of unlabelled data.
10
eg. Bayesian Classifier
● P(Y|X) depends on P(X|Y) and P(X)
● Knowledge of distribution P(X) can help to predict P(Y|X)
● Good model of P(X) must have Y as an implicit latent variable
Bayes rule
X: Data
Y: Labels
Slide: Kevin McGuinness (DLCV UPC 2017)
Why Unsupervised Learning ?
11
eg. Clustering before linear classification
Slide: Kevin McGuinness (DLCV UPC 2017)
x1
x2
Cluster 1 Cluster 2
Cluster 3
Cluster 4
1 2 3 4
1 2 3 4
4D Bag of Words
representation
Separable with a
linear classifiers
in a 4D space !
Why Unsupervised Learning ?
12
Example from Kevin McGuinness Juyter notebook.
Why Unsupervised Learning ?
eg. Clustering before linear classification
13
Assumptions for Unsupervised Learning
Slide: Kevin McGuinness (DLCV UPC 2017)
To model P(X) given data, it is necessary to make some assumptions
“You can’t do inference without making assumptions”
-- David MacKay, Information Theory, Inference, and Learning Algorithms
14
Assumptions for Unsupervised Learning
Slide: Kevin McGuinness (DLCV UPC 2017)
To model P(X) given data, it is necessary to make some assumptions
“You can’t do inference without making assumptions”
-- David MacKay, Information Theory, Inference, and Learning Algorithms
Typical assumptions:
● Smoothness assumption
○ Points which are close to each other are more likely to share a label.
● Cluster assumption
○ The data form discrete clusters; points in the same cluster are likely to share a label
● Manifold assumption
○ The data lie approximately on a manifold of much lower dimension than the input space.
15
Assumptions for Unsupervised Learning
Slide: Kevin McGuinness (DLCV UPC 2017)
Smoothness assumption
● Label propagation
○ Recursively propagate labels to nearby
points
○ Problem: in high-D, your nearest neighbour
may be very far away!
Cluster assumption
● Bag of words models
○ K-means, etc.
○ Represent points by cluster centers
○ Soft assignment
○ VLAD
● Gaussian mixture models
○ Fisher vectors
Manifold assumption
● Linear manifolds
○ PCA
○ Linear autoencoders
○ Random projections
○ ICA
● Non-linear manifolds:
○ Non-linear autoencoders
○ Deep autoencoders
○ Restricted Boltzmann machines
○ Deep belief nets
16
Assumptions for Unsupervised Learning
Slide: Kevin McGuinness (DLCV UPC 2017)
Smoothness assumption
● Label propagation
○ Recursively propagate labels to nearby
points
○ Problem: in high-D, your nearest neighbour
may be very far away!
Cluster assumption
● Bag of words models
○ K-means, etc.
○ Represent points by cluster centers
○ Soft assignment
○ VLAD
● Gaussian mixture models
○ Fisher vectors
Manifold assumption
● Linear manifolds
○ PCA
○ Linear autoencoders
○ Random projections
○ ICA
● Non-linear manifolds:
○ Non-linear autoencoders
○ Deep autoencoders
○ Restricted Boltzmann machines
○ Deep belief nets
17
The manifold hypothesis
Slide: Kevin McGuinness (DLCV UPC 2017)
The data distribution lie close to a low-dimensional
manifold
Example: consider image data
● Very high dimensional (1,000,000D)
● A randomly generated image will almost certainly not
look like any real world scene
○ The space of images that occur in nature is
almost completely empty
● Hypothesis: real world images lie on a smooth,
low-dimensional manifold
○ Manifold distance is a good measure of
similarity
Similar for audio and text
18
The manifold hypothesis
Slide: Kevin McGuinness (DLCV UPC 2017)
x1
x2
Linear manifold
wT
x + b
x1
x2
Non-linear
manifold
19
Assumptions for Unsupervised Learning
Slide: Kevin McGuinness (DLCV UPC 2017)
Smoothness assumption
● Label propagation
○ Recursively propagate labels to nearby
points
○ Problem: in high-D, your nearest neighbour
may be very far away!
Cluster assumption
● Bag of words models
○ K-means, etc.
○ Represent points by cluster centers
○ Soft assignment
○ VLAD
● Gaussian mixture models
○ Fisher vectors
Manifold assumption
● Linear manifolds
○ PCA
○ Linear autoencoders
○ Random projections
○ ICA
● Non-linear manifolds:
○ Non-linear autoencoders
○ Deep autoencoders
○ Restricted Boltzmann machines
○ Deep belief nets
20
F. Van Veen, “The Neural Network Zoo” (2016)
Autoencoder (AE)
21
Autoencoder (AE)
Fig: “Deep Learning Tutorial” Stanford
Autoencoders:
● Predict at the output the
same input data.
● Do not need labels.
22
Autoencoder (AE)
Fig: “Deep Learning Tutorial” Stanford
Dimensionality reduction:
Use the hidden layer as a
feature extractor of any
desired size.
23
Autoencoder (AE)
Slide: Kevin McGuinness (DLCV UPC 2017)
Encoder
W1
Decoder
W2
hdata reconstruction
Loss
(reconstruction error)
Latent variables
(representation/features)
Pretraining:
1. Initialize a NN by solving an autoencoding
problem.
24
Autoencoder (AE)
Slide: Kevin McGuinness (DLCV UPC 2017)
Encoder
W1
hdata Classifier
WC
Latent variables
(representation/features)
prediction
y Loss
(cross entropy)
Pretraining:
1. Initialize a NN solving an autoencoding
problem.
2. Train for final task with “few” labels.
25
Autoencoder (AE)
Deep Learning TV, “Autoencoders - Ep. 10”
26
F. Van Veen, “The Neural Network Zoo” (2016)
Restricted Boltzmann Machine (RBM)
RBMs are a specific type of autoencoder.
27
Restricted Boltzmann Machine (RBM)
Figure: Geoffrey Hinton (2013)
Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. "Restricted Boltzmann machines for
collaborative filtering." ICML 2007.
28
Restricted Boltzmann Machine (RBM)
DeepLearning4j, “A Beginner’s Tutorial for Restricted Boltzmann Machines”.
Forward
pass
29
Restricted Boltzmann Machine (RBM)
DeepLearning4j, “A Beginner’s Tutorial for Restricted Boltzmann Machines”.
Backward
pass
30
Restricted Boltzmann Machine (RBM)
DeepLearning4j, “A Beginner’s Tutorial for Restricted Boltzmann Machines”.
Backward
pass
31
Restricted Boltzmann Machine (RBM)
DeepLearning4j, “A Beginner’s Tutorial for Restricted Boltzmann Machines”.
Backward
pass
The reconstructed values at
the visible layer are
compared with the actual
ones with the KL
Divergence.
32
F. Van Veen, “The Neural Network Zoo” (2016)
Deep Belief Networks (DBN)
33
Deep Belief Networks (DBN)
Architecture like an MLP,
training as a stack of RBMs.
Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief
nets." Neural computation 18, no. 7 (2006): 1527-1554.
34
Deep Belief Networks (DBN)
Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief
nets." Neural computation 18, no. 7 (2006): 1527-1554.
35
Deep Belief Networks (DBN)
Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief
nets." Neural computation 18, no. 7 (2006): 1527-1554.
36
Deep Belief Networks (DBN)
Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief
nets." Neural computation 18, no. 7 (2006): 1527-1554.
37
Deep Belief Networks (DBN)
Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief
nets." Neural computation 18, no. 7 (2006): 1527-1554.
38
Deep Belief Networks (DBN) + Softmax
Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief
nets." Neural computation 18, no. 7 (2006): 1527-1554.
Softmax
After the DBN is trained, it can
be fine-tuned with a reduced
amount of labels to solve a
supervised task with superior
performance.
Supervised
learning
39
Deep Belief Networks (DBN)
Deep Learning TV, “Deep Belief Nets”
40
Deep Belief Networks (DBN) + Softmax
Geoffrey Hinton, "Introduction to Deep Learning & Deep Belief Nets” (2012)
Geoffrey Hinton, “Tutorial on Deep Belief Networks”. NIPS 2007.
41
Geoffrey Hinton
42
Outline
1. Unsupervised Learning
2. Semi-supervised Learning
3. Self-supervised Learning
43
Slide: Kevin McGuinness (DLCV UPC 2017)
Why Semi-supervised Learning ?
Max margin decision boundary
44
Slide: Kevin McGuinness (DLCV UPC 2017)
Why Semi-supervised Learning ?
Semi supervised decision
boundary
45
Slide: Kevin McGuinness (DLCV UPC 2017)
Why Semi-supervised Learning ?
46
Slide: Kevin McGuinness (DLCV UPC 2017)
Why Semi-supervised Learning ?
47
Outline
1. Unsupervised Learning
2. Semi-supervised Learning
3. Self-supervised Learning
a. Predictive Learning
b. Learning by verification
c. Cross-modal Learning
48
Outline
1. Unsupervised Learning
2. Semi-supervised Learning
3. Self-supervised Learning
a. Predictive Learning
b. Learning by verification
c. Cross-modal Learning
49
Acknowledgements
Víctor Campos Junting Pan Xunyu Lin
50
Predictive Learning
Slide credit:
Yann LeCun
51
Predictive Learning
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video
Representations using LSTMs." In ICML 2015. [Github]
Learning video representations (features) by...
52
Predictive Learning
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video
Representations using LSTMs." In ICML 2015. [Github]
(1) frame reconstruction (AE):
Learning video representations (features) by...
53
Predictive Learning
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video
Representations using LSTMs." In ICML 2015. [Github]
Learning video representations (features) by...
(2) frame prediction
54
Predictive Learning
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video
Representations using LSTMs." In ICML 2015. [Github]
Unsupervised learned features (lots of data) are
fine-tuned for activity recognition (small data).
55
Predictive Learning
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
Video frame prediction with a ConvNet.
56
Predictive Learning
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
The blurry predictions from MSE are improved with multi-scale architecture,
adversarial learning and an image gradient difference loss function.
57
Predictive Learning
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
The blurry predictions from MSE (l1) are improved with multi-scale architecture,
adversarial training and an image gradient difference loss (GDL) function.
58
Predictive Learning
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
59
Outline
1. Unsupervised Learning
2. Semi-supervised Learning
3. Self-supervised Learning
a. Predictive Learning
b. Learning by verification
c. Cross-modal Learning
60
Spatiotemporal coherence
Le, Quoc V., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant
spatio-temporal features for action recognition with independent subspace analysis." CVPR 2011
61
Assumption: adjacent video frames contain semantically similar information.
Autoencoder trained with regularizations by slowliness and sparisty.
Goroshin, Ross, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. "Unsupervised learning of
spatiotemporally coherent metrics." ICCV 2015.
Temporal coherence
62
Temporal coherence
Jayaraman, Dinesh, and Kristen Grauman. "Slow and steady feature analysis: higher order temporal
coherence in video." CVPR 2016. [video]
Slow feature analysis
● Temporal coherence assumption: features
should change slowly over time in video
Steady feature analysis
● Second order changes also small: changes in
the past should resemble changes in the future
Train on triplets of frames from video
Loss encourages nearby frames to have slow
and steady features, and far frames to have
different features
63
Temporal coherence
(Slides by Xunyu Lin): Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code]
Temporal order of frames is
exploited as the supervisory
signal for learning.
64
Temporal coherence
(Slides by Xunyu Lin): Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code]
Take temporal order as the supervisory signals for learning
Shuffled
sequences
Binary classification
In order
Not in order
65
Temporal coherence
Fernando, Basura, Hakan Bilen, Efstratios Gavves, and Stephen Gould. "Self-supervised video representation learning with
odd-one-out networks." CVPR 2017
Train a network to detect which of the video sequences contains frames in the wrong order.
66
Outline
1. Unsupervised Learning
2. Semi-supervised Learning
3. Self-supervised Learning
a. Predictive Learning
b. Learning by verification
c. Cross-modal Learning
67
Cross-modal Learning
Representation or
Embedding
Encoder Decoder
68
Cross-modal Learning
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis
bit.ly/DLCV2018
#DLUPC
69
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In
ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
70
Cross-modal Learning
Representation or
Embedding
Encoder Decoder
71
Cross-modal Learning
Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
bit.ly/DLCV2018
#DLUPC
72Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
73
Cross-modal Learning
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.
bit.ly/DLCV2018
#DLUPC
74
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
75
Cross-modal Learning
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation
by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
bit.ly/DLCV2018
#DLUPC
76
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
77
Cross-modal Learning
Label
Annotator
Encoder
78
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016.
79
Cross-modal Learning
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from
unlabeled video." NIPS 2016.
Object & Scenes recognition in videos by analysing the audio track (only).
80
Cross-modal Learning
Representation or
EmbeddingEncoder
Encoder
81
Cross-modal Learning
Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Audio and visual features learned by assessing alignment.
82
Cross-modal Learning
Audio and visual features learned by assessing alignment.
Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.
Audio-visual Object localization Network
83
Cross-modal Learning
Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual
context." NIPS 2016. [talk]
Train a visual & speech networks with pairs of (non-)corresponding images & speech.
84
Cross-modal Learning
Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual
context." NIPS 2016. [talk]
Similarity curve show which regions of the spectrogram are relevant for the image.
Important: no text transcriptions used during the training !!
85
Cross-modal Learning
Harwath, David, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly
Discovering Visual Objects and Spoken Words from Raw Sensory Input." ECCV 2018.
Regions matching the spoken word “WOMAN”:
86
Outline
1. Unsupervised Learning
2. Semi-supervised Learning
3. Self-supervised Learning
a. Predictive Learning
b. Learning by verification
c. Cross-modal Learning
87
Final Questions

Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018

  • 1.
    Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor UniversitatPolitecnica de Catalunya Technical University of Catalonia Learning without Annotations Day 4 Lecture 2 #DLUPC http://bit.ly/dlai2018
  • 2.
    2 Outline 1. Unsupervised Learning 2.Semi-supervised Learning 3. Self-supervised Learning
  • 3.
    3 Outline 1. Unsupervised Learning 2.Semi-supervised Learning 3. Self-supervised Learning
  • 4.
  • 5.
    5 Video lectures Kevin McGuinness,DLCV 2016 Xavier Giró, DLAI 2017
  • 6.
    Types of machinelearning Yann Lecun’s Black Forest cake 6
  • 7.
    7 Types of machinelearning We can categorize three types of learning procedures: 1. Supervised Learning: 𝐲 = ƒ(𝐱) 2. Unsupervised Learning: ƒ(𝐱) 3. Reinforcement Learning (RL): 𝐲 = ƒ(𝐱) 𝐳 Predict label y corresponding to observation x Estimate the distribution of observation x Predict action y based on observation x, to maximize a future reward z
  • 8.
    8 Types of machinelearning We can categorize three types of learning procedures: 1. Supervised Learning: 𝐲 = ƒ(𝐱) 2. Unsupervised Learning: ƒ(𝐱) 3. Reinforcement Learning (RL): 𝐲 = ƒ(𝐱) 𝐳
  • 9.
    9 Why Unsupervised Learning? ● It is the nature of how intelligent beings percept the world. ● It can save us tons of efforts to build a human-alike intelligent agent compared to a totally supervised fashion. ● Vast amounts of unlabelled data.
  • 10.
    10 eg. Bayesian Classifier ●P(Y|X) depends on P(X|Y) and P(X) ● Knowledge of distribution P(X) can help to predict P(Y|X) ● Good model of P(X) must have Y as an implicit latent variable Bayes rule X: Data Y: Labels Slide: Kevin McGuinness (DLCV UPC 2017) Why Unsupervised Learning ?
  • 11.
    11 eg. Clustering beforelinear classification Slide: Kevin McGuinness (DLCV UPC 2017) x1 x2 Cluster 1 Cluster 2 Cluster 3 Cluster 4 1 2 3 4 1 2 3 4 4D Bag of Words representation Separable with a linear classifiers in a 4D space ! Why Unsupervised Learning ?
  • 12.
    12 Example from KevinMcGuinness Juyter notebook. Why Unsupervised Learning ? eg. Clustering before linear classification
  • 13.
    13 Assumptions for UnsupervisedLearning Slide: Kevin McGuinness (DLCV UPC 2017) To model P(X) given data, it is necessary to make some assumptions “You can’t do inference without making assumptions” -- David MacKay, Information Theory, Inference, and Learning Algorithms
  • 14.
    14 Assumptions for UnsupervisedLearning Slide: Kevin McGuinness (DLCV UPC 2017) To model P(X) given data, it is necessary to make some assumptions “You can’t do inference without making assumptions” -- David MacKay, Information Theory, Inference, and Learning Algorithms Typical assumptions: ● Smoothness assumption ○ Points which are close to each other are more likely to share a label. ● Cluster assumption ○ The data form discrete clusters; points in the same cluster are likely to share a label ● Manifold assumption ○ The data lie approximately on a manifold of much lower dimension than the input space.
  • 15.
    15 Assumptions for UnsupervisedLearning Slide: Kevin McGuinness (DLCV UPC 2017) Smoothness assumption ● Label propagation ○ Recursively propagate labels to nearby points ○ Problem: in high-D, your nearest neighbour may be very far away! Cluster assumption ● Bag of words models ○ K-means, etc. ○ Represent points by cluster centers ○ Soft assignment ○ VLAD ● Gaussian mixture models ○ Fisher vectors Manifold assumption ● Linear manifolds ○ PCA ○ Linear autoencoders ○ Random projections ○ ICA ● Non-linear manifolds: ○ Non-linear autoencoders ○ Deep autoencoders ○ Restricted Boltzmann machines ○ Deep belief nets
  • 16.
    16 Assumptions for UnsupervisedLearning Slide: Kevin McGuinness (DLCV UPC 2017) Smoothness assumption ● Label propagation ○ Recursively propagate labels to nearby points ○ Problem: in high-D, your nearest neighbour may be very far away! Cluster assumption ● Bag of words models ○ K-means, etc. ○ Represent points by cluster centers ○ Soft assignment ○ VLAD ● Gaussian mixture models ○ Fisher vectors Manifold assumption ● Linear manifolds ○ PCA ○ Linear autoencoders ○ Random projections ○ ICA ● Non-linear manifolds: ○ Non-linear autoencoders ○ Deep autoencoders ○ Restricted Boltzmann machines ○ Deep belief nets
  • 17.
    17 The manifold hypothesis Slide:Kevin McGuinness (DLCV UPC 2017) The data distribution lie close to a low-dimensional manifold Example: consider image data ● Very high dimensional (1,000,000D) ● A randomly generated image will almost certainly not look like any real world scene ○ The space of images that occur in nature is almost completely empty ● Hypothesis: real world images lie on a smooth, low-dimensional manifold ○ Manifold distance is a good measure of similarity Similar for audio and text
  • 18.
    18 The manifold hypothesis Slide:Kevin McGuinness (DLCV UPC 2017) x1 x2 Linear manifold wT x + b x1 x2 Non-linear manifold
  • 19.
    19 Assumptions for UnsupervisedLearning Slide: Kevin McGuinness (DLCV UPC 2017) Smoothness assumption ● Label propagation ○ Recursively propagate labels to nearby points ○ Problem: in high-D, your nearest neighbour may be very far away! Cluster assumption ● Bag of words models ○ K-means, etc. ○ Represent points by cluster centers ○ Soft assignment ○ VLAD ● Gaussian mixture models ○ Fisher vectors Manifold assumption ● Linear manifolds ○ PCA ○ Linear autoencoders ○ Random projections ○ ICA ● Non-linear manifolds: ○ Non-linear autoencoders ○ Deep autoencoders ○ Restricted Boltzmann machines ○ Deep belief nets
  • 20.
    20 F. Van Veen,“The Neural Network Zoo” (2016) Autoencoder (AE)
  • 21.
    21 Autoencoder (AE) Fig: “DeepLearning Tutorial” Stanford Autoencoders: ● Predict at the output the same input data. ● Do not need labels.
  • 22.
    22 Autoencoder (AE) Fig: “DeepLearning Tutorial” Stanford Dimensionality reduction: Use the hidden layer as a feature extractor of any desired size.
  • 23.
    23 Autoencoder (AE) Slide: KevinMcGuinness (DLCV UPC 2017) Encoder W1 Decoder W2 hdata reconstruction Loss (reconstruction error) Latent variables (representation/features) Pretraining: 1. Initialize a NN by solving an autoencoding problem.
  • 24.
    24 Autoencoder (AE) Slide: KevinMcGuinness (DLCV UPC 2017) Encoder W1 hdata Classifier WC Latent variables (representation/features) prediction y Loss (cross entropy) Pretraining: 1. Initialize a NN solving an autoencoding problem. 2. Train for final task with “few” labels.
  • 25.
    25 Autoencoder (AE) Deep LearningTV, “Autoencoders - Ep. 10”
  • 26.
    26 F. Van Veen,“The Neural Network Zoo” (2016) Restricted Boltzmann Machine (RBM) RBMs are a specific type of autoencoder.
  • 27.
    27 Restricted Boltzmann Machine(RBM) Figure: Geoffrey Hinton (2013) Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. "Restricted Boltzmann machines for collaborative filtering." ICML 2007.
  • 28.
    28 Restricted Boltzmann Machine(RBM) DeepLearning4j, “A Beginner’s Tutorial for Restricted Boltzmann Machines”. Forward pass
  • 29.
    29 Restricted Boltzmann Machine(RBM) DeepLearning4j, “A Beginner’s Tutorial for Restricted Boltzmann Machines”. Backward pass
  • 30.
    30 Restricted Boltzmann Machine(RBM) DeepLearning4j, “A Beginner’s Tutorial for Restricted Boltzmann Machines”. Backward pass
  • 31.
    31 Restricted Boltzmann Machine(RBM) DeepLearning4j, “A Beginner’s Tutorial for Restricted Boltzmann Machines”. Backward pass The reconstructed values at the visible layer are compared with the actual ones with the KL Divergence.
  • 32.
    32 F. Van Veen,“The Neural Network Zoo” (2016) Deep Belief Networks (DBN)
  • 33.
    33 Deep Belief Networks(DBN) Architecture like an MLP, training as a stack of RBMs. Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18, no. 7 (2006): 1527-1554.
  • 34.
    34 Deep Belief Networks(DBN) Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18, no. 7 (2006): 1527-1554.
  • 35.
    35 Deep Belief Networks(DBN) Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18, no. 7 (2006): 1527-1554.
  • 36.
    36 Deep Belief Networks(DBN) Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18, no. 7 (2006): 1527-1554.
  • 37.
    37 Deep Belief Networks(DBN) Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18, no. 7 (2006): 1527-1554.
  • 38.
    38 Deep Belief Networks(DBN) + Softmax Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18, no. 7 (2006): 1527-1554. Softmax After the DBN is trained, it can be fine-tuned with a reduced amount of labels to solve a supervised task with superior performance. Supervised learning
  • 39.
    39 Deep Belief Networks(DBN) Deep Learning TV, “Deep Belief Nets”
  • 40.
    40 Deep Belief Networks(DBN) + Softmax Geoffrey Hinton, "Introduction to Deep Learning & Deep Belief Nets” (2012) Geoffrey Hinton, “Tutorial on Deep Belief Networks”. NIPS 2007.
  • 41.
  • 42.
    42 Outline 1. Unsupervised Learning 2.Semi-supervised Learning 3. Self-supervised Learning
  • 43.
    43 Slide: Kevin McGuinness(DLCV UPC 2017) Why Semi-supervised Learning ? Max margin decision boundary
  • 44.
    44 Slide: Kevin McGuinness(DLCV UPC 2017) Why Semi-supervised Learning ? Semi supervised decision boundary
  • 45.
    45 Slide: Kevin McGuinness(DLCV UPC 2017) Why Semi-supervised Learning ?
  • 46.
    46 Slide: Kevin McGuinness(DLCV UPC 2017) Why Semi-supervised Learning ?
  • 47.
    47 Outline 1. Unsupervised Learning 2.Semi-supervised Learning 3. Self-supervised Learning a. Predictive Learning b. Learning by verification c. Cross-modal Learning
  • 48.
    48 Outline 1. Unsupervised Learning 2.Semi-supervised Learning 3. Self-supervised Learning a. Predictive Learning b. Learning by verification c. Cross-modal Learning
  • 49.
  • 50.
  • 51.
    51 Predictive Learning Srivastava, Nitish,Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Learning video representations (features) by...
  • 52.
    52 Predictive Learning Srivastava, Nitish,Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] (1) frame reconstruction (AE): Learning video representations (features) by...
  • 53.
    53 Predictive Learning Srivastava, Nitish,Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Learning video representations (features) by... (2) frame prediction
  • 54.
    54 Predictive Learning Srivastava, Nitish,Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using LSTMs." In ICML 2015. [Github] Unsupervised learned features (lots of data) are fine-tuned for activity recognition (small data).
  • 55.
    55 Predictive Learning Mathieu, Michael,Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code] Video frame prediction with a ConvNet.
  • 56.
    56 Predictive Learning Mathieu, Michael,Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code] The blurry predictions from MSE are improved with multi-scale architecture, adversarial learning and an image gradient difference loss function.
  • 57.
    57 Predictive Learning Mathieu, Michael,Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code] The blurry predictions from MSE (l1) are improved with multi-scale architecture, adversarial training and an image gradient difference loss (GDL) function.
  • 58.
    58 Predictive Learning Mathieu, Michael,Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016 [project] [code]
  • 59.
    59 Outline 1. Unsupervised Learning 2.Semi-supervised Learning 3. Self-supervised Learning a. Predictive Learning b. Learning by verification c. Cross-modal Learning
  • 60.
    60 Spatiotemporal coherence Le, QuocV., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis." CVPR 2011
  • 61.
    61 Assumption: adjacent videoframes contain semantically similar information. Autoencoder trained with regularizations by slowliness and sparisty. Goroshin, Ross, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. "Unsupervised learning of spatiotemporally coherent metrics." ICCV 2015. Temporal coherence
  • 62.
    62 Temporal coherence Jayaraman, Dinesh,and Kristen Grauman. "Slow and steady feature analysis: higher order temporal coherence in video." CVPR 2016. [video] Slow feature analysis ● Temporal coherence assumption: features should change slowly over time in video Steady feature analysis ● Second order changes also small: changes in the past should resemble changes in the future Train on triplets of frames from video Loss encourages nearby frames to have slow and steady features, and far frames to have different features
  • 63.
    63 Temporal coherence (Slides byXunyu Lin): Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using temporal order verification." ECCV 2016. [code] Temporal order of frames is exploited as the supervisory signal for learning.
  • 64.
    64 Temporal coherence (Slides byXunyu Lin): Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using temporal order verification." ECCV 2016. [code] Take temporal order as the supervisory signals for learning Shuffled sequences Binary classification In order Not in order
  • 65.
    65 Temporal coherence Fernando, Basura,Hakan Bilen, Efstratios Gavves, and Stephen Gould. "Self-supervised video representation learning with odd-one-out networks." CVPR 2017 Train a network to detect which of the video sequences contains frames in the wrong order.
  • 66.
    66 Outline 1. Unsupervised Learning 2.Semi-supervised Learning 3. Self-supervised Learning a. Predictive Learning b. Learning by verification c. Cross-modal Learning
  • 67.
  • 68.
    68 Cross-modal Learning Ephrat etal. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 CNN (VGG) Frame from a silent video Audio feature Post-hoc synthesis
  • 69.
    bit.ly/DLCV2018 #DLUPC 69 Ephrat, Ariel, TaviHalperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV 2017 Workshop on Computer Vision for Audio-Visual Media. 2017.
  • 70.
  • 71.
    71 Cross-modal Learning Chung, JoonSon, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
  • 72.
    bit.ly/DLCV2018 #DLUPC 72Chung, Joon Son,Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
  • 73.
    73 Cross-modal Learning Suwajanakorn, Supasorn,Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from audio." SIGGRAPH 2017.
  • 74.
    bit.ly/DLCV2018 #DLUPC 74 Karras, Tero, TimoAila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  • 75.
    75 Cross-modal Learning Karras, Tero,Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  • 76.
    bit.ly/DLCV2018 #DLUPC 76 Karras, Tero, TimoAila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end learning of pose and emotion." SIGGRAPH 2017
  • 77.
  • 78.
    78 Aytar, Yusuf, CarlVondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
  • 79.
    79 Cross-modal Learning Aytar, Yusuf,Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Object & Scenes recognition in videos by analysing the audio track (only).
  • 80.
  • 81.
    81 Cross-modal Learning Arandjelović, Relja,and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Audio and visual features learned by assessing alignment.
  • 82.
    82 Cross-modal Learning Audio andvisual features learned by assessing alignment. Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018. Audio-visual Object localization Network
  • 83.
    83 Cross-modal Learning Harwath, David,Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual context." NIPS 2016. [talk] Train a visual & speech networks with pairs of (non-)corresponding images & speech.
  • 84.
    84 Cross-modal Learning Harwath, David,Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual context." NIPS 2016. [talk] Similarity curve show which regions of the spectrogram are relevant for the image. Important: no text transcriptions used during the training !!
  • 85.
    85 Cross-modal Learning Harwath, David,Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input." ECCV 2018. Regions matching the spoken word “WOMAN”:
  • 86.
    86 Outline 1. Unsupervised Learning 2.Semi-supervised Learning 3. Self-supervised Learning a. Predictive Learning b. Learning by verification c. Cross-modal Learning
  • 87.