17th November, 2019
PR12 Paper Review
Ho Seong Lee
Cognex + SUALAB
Unsupervised Visual Representation
Learning Overview
“Toward Self-Supervision”
Contents
• What is “Self-Supervision”?
• Self-Supervised Visual Representation Learning
• Exemplar
• Relative Patch Location
• Jigsaw Puzzles
• Count
• Multi-task
• Rotation
• Autoencoder Base
What is “Self-Supervision”?
• Supervised Learning is powerful, but need a large amount of labeled data
• many research for tackling this problem is in progress.
• Transfer learning, Domain adaptation, Semi-supervised, Weakly-supervised and Unsupervised Learning
• Self-Supervised Visual Representation Learning
• Sub-class of Unsupervised learning where the data provides the “self-supervision”
• Define pretext tasks which can be formulated using only unlabeled data, but do require higher-level
semantic understanding in order to solved
• The features obtained with pretext tasks can be successfully transferred to classification, detection task
What is “Self-Supervision”?
• Pretext task in Self-Supervised Visual Representation Learning
• Exemplar, 2014 NIPS
• Relative Patch Location, 2015 ICCV
• Jigsaw Puzzles, 2016 ECCV
• Autoencoder Base Approaches - Denoising Autoencoder(2008), Context Autoencoder(2016),
Colorization(2016), Split-brain Autoencoders(2017)
• Count, 2017 ICCV
• Multi-task, 2017 ICCV
• Rotation, 2018 ICLR
Self-Supervised Visual Representation Learning – Exemplar
• ”Discriminative unsupervised feature learning with exemplar convolutional neural
networks”, 2014 NIPS
• Randomly sample 𝑁 ∈ [50, 32000] patches of size 32x32 from different images
• Apply a various transformations to a randomly sampled “seed” image patch
• Train to classify these exemplars as same class → cannot be scalable to large datasets!
Seed
Transformations
Regions containing considerable gradients
Train with STL-10 dataset (96x96)
Self-Supervised Visual Representation Learning – Relative Patch Location
• “Unsupervised Visual Representation Learning by Context Prediction”, 2015 ICCV
• Aim to Self-Supervised Learning for image data using context prediction
• The algorithm must guess the position of one patch relative to the other
Self-Supervised Visual Representation Learning – Relative Patch Location
• “Unsupervised Visual Representation Learning by Context Prediction”, 2015 ICCV
• AlexNet based architecture for pair classification
• Avoid “trivial” solutions using two precautions: Include a gap, Randomly jitter
Shared weight
Include a gap between patches
Randomly jitter each patch location
Self-Supervised Visual Representation Learning – Jigsaw Puzzles
• “Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles”, 2016 ECCV
• Recover relative spatial position of 9 randomly sampled image patches after random permutation
• 9! = 362,880 permutations, so remove similar permutations → use predefined permutation set (100)
• Network output is 100-d vector that predicts a permutation index
Permutation
9, 5, 8, 3, 2, 4, 7, 1, 6
Sample image
Extract 9 patches Permutate 9 patches
Index (0~99)
61
Self-Supervised Visual Representation Learning – Jigsaw Puzzles
• “Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles”, 2016 ECCV
• Propose the context-free network(CFN), a Siamese-ennead CNN
• Fewer parameters than AlexNet while preserving the same semantic learning capabilities
Self-Supervised Visual Representation Learning – Autoencoder-Base Approaches
• Autoencoder-Base Approaches
• Denoising Autoencoder, Context Autoencoder, Colorization, Split-brain Autoencoders
• Learn image features from reconstructing images without any annotation
Context Autoencoder
Image ColorizationDenoising Autoencoder
Split-Brain Autoencoder
Random
noise
Self-Supervised Visual Representation Learning – Count
• “Representation Learning by Learning to Count”, 2017 ICCV
• The number of visual primitives in the whole image should match the sum of those in each tile
• Also, a feature that counts visual primitives should not be affected by scale, translation and rotation
• In this work, use downsampling(D) and tiling(𝑇𝑗, j=1, 2, 3, 4)
* This values are not label, just example for explanation!!!!
Self-Supervised Visual Representation Learning – Count
• “Representation Learning by Learning to Count”, 2017 ICCV
• The feature vector(counting vector) is used for calculating loss
• Use 𝒍 𝟐 𝒍𝒐𝒔𝒔 , but all feature vectors = 0 → trivial solution, so use 𝒄𝒐𝒏𝒕𝒓𝒂𝒔𝒕𝒊𝒗𝒆 𝒍𝒐𝒔𝒔
• Enforce that the counting features should be different between two randomly chosen different images
Self-Supervised Visual Representation Learning – Multi-task
• “Multi-task Self-Supervised Visual Learning”, 2017 ICCV
• Implement four different self-supervision methods and make one single neural network
• Relative Patch Location + Colorization + Exemplar + Motion Segmentation
• Evaluate for ImageNet (Classification), PASCAL VOC 2007 (Detection), NYU V2 (Depth Prediction)
Self-Supervised Visual Representation Learning – Rotations
• “Unsupervised representation learning by predicting image rotations”, 2018 ICLR
• Rotate a single image and classify the rotation which was applied – {0°, 90°, 180°, 270°}
• Intuitively, a good model should learn to recognize canonical orientations of objects in natural images
Self-Supervised Visual Representation Learning
• Task Generalization of Self-Supervised Learning: ImageNet classification
• All unsupervised methods are pre-trained on ImageNet without labels (unsupervised way)
• All weights are frozen and feature maps are spatially resized so as to have around 9000 elements
• Train linear classifiers on top of the feature maps of each layer by logistic regression
• All approaches use AlexNet variants
Pre-train with self-supervision
Self-Supervised Visual Representation Learning
• Task Generalization of Self-Supervised Learning: ImageNet classification
• SGD with batch size 192, momentum 0.9, weight decay 5e-4, learning rate 0.01
• Learning decay by a factor of 10 after epochs 10, 20 / train in total for 30 epochs
ImageNet top-1 classification
Self-
Supervised
Self-Supervised Visual Representation Learning
• Task & Dataset Generalization of Self-Supervised Learning: PASCAL VOC
• PASCAL VOC 2007 classification, detection and PASCAL VOC 2012 segmentation
• All unsupervised methods are pre-trained on ImageNet without labels (unsupervised way)
Self-
Supervised
finetuning
Self-Supervised Visual Representation Learning
• Recent papers not covered today..
• Deep Cluster (2018, ECCV)
• Revisiting Self-Supervised Visual Representation Learning (2019, CVPR)
• Selfie (2019, arXiv)
• Deeper Cluster (2019, ICCV)
• S4L (2019, ICCV)
Deeper ClusterDeep Cluster
Self-Supervised Visual Representation Learning
• Summary
• Define pretext tasks which can be formulated using only unlabeled data, but do require higher-level
semantic understanding in order to solved
• Pre-train feature extractor and transfer to downstream task (classification, detection, etc.)
pretext tasks
Thanks!

Unsupervised visual representation learning overview: Toward Self-Supervision

  • 1.
    17th November, 2019 PR12Paper Review Ho Seong Lee Cognex + SUALAB Unsupervised Visual Representation Learning Overview “Toward Self-Supervision”
  • 2.
    Contents • What is“Self-Supervision”? • Self-Supervised Visual Representation Learning • Exemplar • Relative Patch Location • Jigsaw Puzzles • Count • Multi-task • Rotation • Autoencoder Base
  • 3.
    What is “Self-Supervision”? •Supervised Learning is powerful, but need a large amount of labeled data • many research for tackling this problem is in progress. • Transfer learning, Domain adaptation, Semi-supervised, Weakly-supervised and Unsupervised Learning • Self-Supervised Visual Representation Learning • Sub-class of Unsupervised learning where the data provides the “self-supervision” • Define pretext tasks which can be formulated using only unlabeled data, but do require higher-level semantic understanding in order to solved • The features obtained with pretext tasks can be successfully transferred to classification, detection task
  • 4.
    What is “Self-Supervision”? •Pretext task in Self-Supervised Visual Representation Learning • Exemplar, 2014 NIPS • Relative Patch Location, 2015 ICCV • Jigsaw Puzzles, 2016 ECCV • Autoencoder Base Approaches - Denoising Autoencoder(2008), Context Autoencoder(2016), Colorization(2016), Split-brain Autoencoders(2017) • Count, 2017 ICCV • Multi-task, 2017 ICCV • Rotation, 2018 ICLR
  • 5.
    Self-Supervised Visual RepresentationLearning – Exemplar • ”Discriminative unsupervised feature learning with exemplar convolutional neural networks”, 2014 NIPS • Randomly sample 𝑁 ∈ [50, 32000] patches of size 32x32 from different images • Apply a various transformations to a randomly sampled “seed” image patch • Train to classify these exemplars as same class → cannot be scalable to large datasets! Seed Transformations Regions containing considerable gradients Train with STL-10 dataset (96x96)
  • 6.
    Self-Supervised Visual RepresentationLearning – Relative Patch Location • “Unsupervised Visual Representation Learning by Context Prediction”, 2015 ICCV • Aim to Self-Supervised Learning for image data using context prediction • The algorithm must guess the position of one patch relative to the other
  • 7.
    Self-Supervised Visual RepresentationLearning – Relative Patch Location • “Unsupervised Visual Representation Learning by Context Prediction”, 2015 ICCV • AlexNet based architecture for pair classification • Avoid “trivial” solutions using two precautions: Include a gap, Randomly jitter Shared weight Include a gap between patches Randomly jitter each patch location
  • 8.
    Self-Supervised Visual RepresentationLearning – Jigsaw Puzzles • “Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles”, 2016 ECCV • Recover relative spatial position of 9 randomly sampled image patches after random permutation • 9! = 362,880 permutations, so remove similar permutations → use predefined permutation set (100) • Network output is 100-d vector that predicts a permutation index Permutation 9, 5, 8, 3, 2, 4, 7, 1, 6 Sample image Extract 9 patches Permutate 9 patches Index (0~99) 61
  • 9.
    Self-Supervised Visual RepresentationLearning – Jigsaw Puzzles • “Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles”, 2016 ECCV • Propose the context-free network(CFN), a Siamese-ennead CNN • Fewer parameters than AlexNet while preserving the same semantic learning capabilities
  • 10.
    Self-Supervised Visual RepresentationLearning – Autoencoder-Base Approaches • Autoencoder-Base Approaches • Denoising Autoencoder, Context Autoencoder, Colorization, Split-brain Autoencoders • Learn image features from reconstructing images without any annotation Context Autoencoder Image ColorizationDenoising Autoencoder Split-Brain Autoencoder Random noise
  • 11.
    Self-Supervised Visual RepresentationLearning – Count • “Representation Learning by Learning to Count”, 2017 ICCV • The number of visual primitives in the whole image should match the sum of those in each tile • Also, a feature that counts visual primitives should not be affected by scale, translation and rotation • In this work, use downsampling(D) and tiling(𝑇𝑗, j=1, 2, 3, 4) * This values are not label, just example for explanation!!!!
  • 12.
    Self-Supervised Visual RepresentationLearning – Count • “Representation Learning by Learning to Count”, 2017 ICCV • The feature vector(counting vector) is used for calculating loss • Use 𝒍 𝟐 𝒍𝒐𝒔𝒔 , but all feature vectors = 0 → trivial solution, so use 𝒄𝒐𝒏𝒕𝒓𝒂𝒔𝒕𝒊𝒗𝒆 𝒍𝒐𝒔𝒔 • Enforce that the counting features should be different between two randomly chosen different images
  • 13.
    Self-Supervised Visual RepresentationLearning – Multi-task • “Multi-task Self-Supervised Visual Learning”, 2017 ICCV • Implement four different self-supervision methods and make one single neural network • Relative Patch Location + Colorization + Exemplar + Motion Segmentation • Evaluate for ImageNet (Classification), PASCAL VOC 2007 (Detection), NYU V2 (Depth Prediction)
  • 14.
    Self-Supervised Visual RepresentationLearning – Rotations • “Unsupervised representation learning by predicting image rotations”, 2018 ICLR • Rotate a single image and classify the rotation which was applied – {0°, 90°, 180°, 270°} • Intuitively, a good model should learn to recognize canonical orientations of objects in natural images
  • 15.
    Self-Supervised Visual RepresentationLearning • Task Generalization of Self-Supervised Learning: ImageNet classification • All unsupervised methods are pre-trained on ImageNet without labels (unsupervised way) • All weights are frozen and feature maps are spatially resized so as to have around 9000 elements • Train linear classifiers on top of the feature maps of each layer by logistic regression • All approaches use AlexNet variants Pre-train with self-supervision
  • 16.
    Self-Supervised Visual RepresentationLearning • Task Generalization of Self-Supervised Learning: ImageNet classification • SGD with batch size 192, momentum 0.9, weight decay 5e-4, learning rate 0.01 • Learning decay by a factor of 10 after epochs 10, 20 / train in total for 30 epochs ImageNet top-1 classification Self- Supervised
  • 17.
    Self-Supervised Visual RepresentationLearning • Task & Dataset Generalization of Self-Supervised Learning: PASCAL VOC • PASCAL VOC 2007 classification, detection and PASCAL VOC 2012 segmentation • All unsupervised methods are pre-trained on ImageNet without labels (unsupervised way) Self- Supervised finetuning
  • 18.
    Self-Supervised Visual RepresentationLearning • Recent papers not covered today.. • Deep Cluster (2018, ECCV) • Revisiting Self-Supervised Visual Representation Learning (2019, CVPR) • Selfie (2019, arXiv) • Deeper Cluster (2019, ICCV) • S4L (2019, ICCV) Deeper ClusterDeep Cluster
  • 19.
    Self-Supervised Visual RepresentationLearning • Summary • Define pretext tasks which can be formulated using only unlabeled data, but do require higher-level semantic understanding in order to solved • Pre-train feature extractor and transfer to downstream task (classification, detection, etc.) pretext tasks
  • 20.