Unsupervised Learning of Object Landmarks
through Conditional Image Generation
Tomas Jakab1∗ Ankush Gupta1∗ Hakan Bilen2 Andrea Vedaldi1
1 Visual Geometry Group, University of Oxford
2 School of Informatics, University of Edinburgh
Advances in Neural Information Processing Systems (NeurIPS) 2018
Bingwen hu
2019-01-20
Goal
Learn semantically meaningful landmarks without any manual annotations.
It automatically learns from images or videos and works across different datasets of faces, humans,
and 3D objects.
Why to learn landmarks?
Low dimensional object representation
Interpretable
Why unsupervised?
Reduce dependency on expensive manual annotations
Leverage vast amount of videos available online
Architecture
Source image
Target image
appearance
encoding
unsupervised keypoint extraction
image
reconstruction
heatmap for each keypoint
Method
(1) Heatmaps bottleneck
Then, each heatmap is replaced with Gaussian-like function centred at u*k with
a small fixed standard deviation
it provides a differentiable and distributed representation of the location of
landmarks.
 it restricts the information from the target image to spatial locations only
(2) Generator network using a perceptual loss
Where Γ(x) is an off-the-shelf pre-trained neural network, for
example VGG-19. Γl denotes the output of the l-th sub-network
 The perceptual loss compares a set of the activations extracted from multiple
layers of a deep network for both the reference and the generated images,
instead of the only raw pixel values.
Model details
• Landmark detection network: ingests the image x' to produce K
landmark heatmaps y'
It is composed of sequential blocks consisting of two convolutional.
The spatial size of the final output, outputting the heatmaps, is set to 16×16.
These K feature channels are then used to render 16×16×K 2D-Gaussian
maps y' (with σ = 0:1)
• Image generation network: input the image x and the landmarks
y' = Φ(x'), reconstructe x'
First, the image x is encoded as a feature tensor Z
Next, the features z and the landmarks y' are stacked to gether and fed to a
regressor that reconstructs the target frame x'.
 Experiments
Experiments——Learning facial landmarks
Experiments——Learning human body landmarks
Experiments——Learning 3D object landmarks
Experiments——Disentangling appearance and geometry
Unsupervised Learning of Object Landmarks through Conditional Image Generation
Unsupervised Learning of Object Landmarks through Conditional Image Generation

Unsupervised Learning of Object Landmarks through Conditional Image Generation

  • 1.
    Unsupervised Learning ofObject Landmarks through Conditional Image Generation Tomas Jakab1∗ Ankush Gupta1∗ Hakan Bilen2 Andrea Vedaldi1 1 Visual Geometry Group, University of Oxford 2 School of Informatics, University of Edinburgh Advances in Neural Information Processing Systems (NeurIPS) 2018 Bingwen hu 2019-01-20
  • 2.
    Goal Learn semantically meaningfullandmarks without any manual annotations. It automatically learns from images or videos and works across different datasets of faces, humans, and 3D objects. Why to learn landmarks? Low dimensional object representation Interpretable Why unsupervised? Reduce dependency on expensive manual annotations Leverage vast amount of videos available online
  • 3.
    Architecture Source image Target image appearance encoding unsupervisedkeypoint extraction image reconstruction heatmap for each keypoint
  • 4.
  • 5.
    (1) Heatmaps bottleneck Then,each heatmap is replaced with Gaussian-like function centred at u*k with a small fixed standard deviation
  • 6.
    it provides adifferentiable and distributed representation of the location of landmarks.  it restricts the information from the target image to spatial locations only
  • 7.
    (2) Generator networkusing a perceptual loss Where Γ(x) is an off-the-shelf pre-trained neural network, for example VGG-19. Γl denotes the output of the l-th sub-network  The perceptual loss compares a set of the activations extracted from multiple layers of a deep network for both the reference and the generated images, instead of the only raw pixel values.
  • 8.
    Model details • Landmarkdetection network: ingests the image x' to produce K landmark heatmaps y' It is composed of sequential blocks consisting of two convolutional. The spatial size of the final output, outputting the heatmaps, is set to 16×16. These K feature channels are then used to render 16×16×K 2D-Gaussian maps y' (with σ = 0:1) • Image generation network: input the image x and the landmarks y' = Φ(x'), reconstructe x' First, the image x is encoded as a feature tensor Z Next, the features z and the landmarks y' are stacked to gether and fed to a regressor that reconstructs the target frame x'.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.