Unsupervised visual representation learning overview: Toward Self-Supervision

17th November, 2019
PR12 Paper Review
Ho Seong Lee
Cognex + SUALAB
Unsupervised Visual Representation
Learning Overview
“Toward Self-Supervision”

Contents
• What is “Self-Supervision”?
• Self-Supervised Visual Representation Learning
• Exemplar
• Relative Patch Location
• Jigsaw Puzzles
• Count
• Multi-task
• Rotation
• Autoencoder Base

What is “Self-Supervision”?
• Supervised Learning is powerful, but need a large amount of labeled data
• many research for tackling this problem is in progress.
• Transfer learning, Domain adaptation, Semi-supervised, Weakly-supervised and Unsupervised Learning
• Self-Supervised Visual Representation Learning
• Sub-class of Unsupervised learning where the data provides the “self-supervision”
• Define pretext tasks which can be formulated using only unlabeled data, but do require higher-level
semantic understanding in order to solved
• The features obtained with pretext tasks can be successfully transferred to classification, detection task

What is “Self-Supervision”?
• Pretext task in Self-Supervised Visual Representation Learning
• Exemplar, 2014 NIPS
• Relative Patch Location, 2015 ICCV
• Jigsaw Puzzles, 2016 ECCV
• Autoencoder Base Approaches - Denoising Autoencoder(2008), Context Autoencoder(2016),
Colorization(2016), Split-brain Autoencoders(2017)
• Count, 2017 ICCV
• Multi-task, 2017 ICCV
• Rotation, 2018 ICLR

Self-Supervised Visual Representation Learning – Exemplar
• ”Discriminative unsupervised feature learning with exemplar convolutional neural
networks”, 2014 NIPS
• Randomly sample 𝑁 ∈ [50, 32000] patches of size 32x32 from different images
• Apply a various transformations to a randomly sampled “seed” image patch
• Train to classify these exemplars as same class → cannot be scalable to large datasets!
Seed
Transformations
Regions containing considerable gradients
Train with STL-10 dataset (96x96)

Self-Supervised Visual Representation Learning – Relative Patch Location
• “Unsupervised Visual Representation Learning by Context Prediction”, 2015 ICCV
• Aim to Self-Supervised Learning for image data using context prediction
• The algorithm must guess the position of one patch relative to the other

Self-Supervised Visual Representation Learning – Relative Patch Location
• “Unsupervised Visual Representation Learning by Context Prediction”, 2015 ICCV
• AlexNet based architecture for pair classification
• Avoid “trivial” solutions using two precautions: Include a gap, Randomly jitter
Shared weight
Include a gap between patches
Randomly jitter each patch location

Self-Supervised Visual Representation Learning – Jigsaw Puzzles
• “Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles”, 2016 ECCV
• Recover relative spatial position of 9 randomly sampled image patches after random permutation
• 9! = 362,880 permutations, so remove similar permutations → use predefined permutation set (100)
• Network output is 100-d vector that predicts a permutation index
Permutation
9, 5, 8, 3, 2, 4, 7, 1, 6
Sample image
Extract 9 patches Permutate 9 patches
Index (0~99)
61

Self-Supervised Visual Representation Learning – Jigsaw Puzzles
• “Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles”, 2016 ECCV
• Propose the context-free network(CFN), a Siamese-ennead CNN
• Fewer parameters than AlexNet while preserving the same semantic learning capabilities

Self-Supervised Visual Representation Learning – Autoencoder-Base Approaches
• Autoencoder-Base Approaches
• Denoising Autoencoder, Context Autoencoder, Colorization, Split-brain Autoencoders
• Learn image features from reconstructing images without any annotation
Context Autoencoder
Image ColorizationDenoising Autoencoder
Split-Brain Autoencoder
Random
noise

Self-Supervised Visual Representation Learning – Count
• “Representation Learning by Learning to Count”, 2017 ICCV
• The number of visual primitives in the whole image should match the sum of those in each tile
• Also, a feature that counts visual primitives should not be affected by scale, translation and rotation
• In this work, use downsampling(D) and tiling(𝑇𝑗, j=1, 2, 3, 4)
* This values are not label, just example for explanation!!!!

Self-Supervised Visual Representation Learning – Count
• “Representation Learning by Learning to Count”, 2017 ICCV
• The feature vector(counting vector) is used for calculating loss
• Use 𝒍 𝟐 𝒍𝒐𝒔𝒔 , but all feature vectors = 0 → trivial solution, so use 𝒄𝒐𝒏𝒕𝒓𝒂𝒔𝒕𝒊𝒗𝒆 𝒍𝒐𝒔𝒔
• Enforce that the counting features should be different between two randomly chosen different images

Self-Supervised Visual Representation Learning – Multi-task
• “Multi-task Self-Supervised Visual Learning”, 2017 ICCV
• Implement four different self-supervision methods and make one single neural network
• Relative Patch Location + Colorization + Exemplar + Motion Segmentation
• Evaluate for ImageNet (Classification), PASCAL VOC 2007 (Detection), NYU V2 (Depth Prediction)

Self-Supervised Visual Representation Learning – Rotations
• “Unsupervised representation learning by predicting image rotations”, 2018 ICLR
• Rotate a single image and classify the rotation which was applied – {0°, 90°, 180°, 270°}
• Intuitively, a good model should learn to recognize canonical orientations of objects in natural images

Self-Supervised Visual Representation Learning
• Task Generalization of Self-Supervised Learning: ImageNet classification
• All unsupervised methods are pre-trained on ImageNet without labels (unsupervised way)
• All weights are frozen and feature maps are spatially resized so as to have around 9000 elements
• Train linear classifiers on top of the feature maps of each layer by logistic regression
• All approaches use AlexNet variants
Pre-train with self-supervision

• Task Generalization of Self-Supervised Learning: ImageNet classification
• SGD with batch size 192, momentum 0.9, weight decay 5e-4, learning rate 0.01
• Learning decay by a factor of 10 after epochs 10, 20 / train in total for 30 epochs
ImageNet top-1 classification
Self-
Supervised

• Task & Dataset Generalization of Self-Supervised Learning: PASCAL VOC
• PASCAL VOC 2007 classification, detection and PASCAL VOC 2012 segmentation
• All unsupervised methods are pre-trained on ImageNet without labels (unsupervised way)
Self-
Supervised
finetuning

• Recent papers not covered today..
• Deep Cluster (2018, ECCV)
• Revisiting Self-Supervised Visual Representation Learning (2019, CVPR)
• Selfie (2019, arXiv)
• Deeper Cluster (2019, ICCV)
• S4L (2019, ICCV)
Deeper ClusterDeep Cluster

• Summary
• Define pretext tasks which can be formulated using only unlabeled data, but do require higher-level
semantic understanding in order to solved
• Pre-train feature extractor and transfer to downstream task (classification, detection, etc.)
pretext tasks

Unsupervised visual representation learning overview: Toward Self-Supervision

More Related Content

What's hot

Similar to Unsupervised visual representation learning overview: Toward Self-Supervision

More from LEE HOSEONG

Recently uploaded

Unsupervised visual representation learning overview: Toward Self-Supervision