Learning visual representation without human label

Learning Visual Representation
without Manual-Label
kv
kelispinor@gmail.com

Types of Learning
With Target Without Target
Active Reinforcement Learning,
Active Learning
Motivation, Exploration
?
Passive Supervised Learning Self-Supervised Learning
Active: Non-Stationary Dataset
Passive: Stationary Dataset

Today’s Topic
Visual Representation
● Global: style, semantics
● Local: attribute
● Metric: embedding

Today’s Topic
General visual features

Today’s Topic
Label
● Much more expensive than general data (can not scale)
● Usually annotated for the specific task
● Contains far less information than data itself
General visual features

Goal of self-supervised learning
● Explore the structure of data distribution
● Task-driven representations are limited by targets (requirements of the task)
● Rapid generalization to new tasks and applications
Motivation

● Hot research topics
● Performance approaches supervised setting
● Relate to deep metric learning
Practical Motivation
SSLFTW → Self-Supervised Learning F**k The World !!!
https://twitter.com/ylecun/status/1228763787244843013

Evolution of ResNet-50
R50: #of params = 24M
Date Training Senario Method Backbone label fraction Top 1 Accuracy Top 5 Accuracy
2019/11
Semi Sup.
Noisy Stduent EfficientNet (480M) 100 + extra 88.4 98.7
2019/05 Teacher-Student
R50 (24M)
100 + extra 81.2
- Sup. - 100 76
2020/02
Self Sup.
SimCLR 0 69.3 89
2019/06 CMC 0 64.1 85.1
2019/12 CPC v2 0 63.8 85.3
2019/12 PIRL 0 63.6
2019/11 MoCo 0 60.6
2019/07 BigBiGAN 0 56.6
2018/05 InstDisc 0 54

● Are manual-label necessary for learning useful concepts? Are data itself contain rich
information?
● Can we treat each image as a single class? Or even single pixel as a class?
● Can we implicitly increase batch size? How to maintain embedding space stability?
● Does final layer contain rich representation?
● Do we reach the complexity upper bound of ResNet-50? If not, what is the efficient
training procedure?
● Is data augmentation a trick or an important DL feature?
Outline: Questions to be Discussed

Part I: Self-Supervised Learning

Previous Works
Generate `labels` by rule
● Rotation
● Jigsaw Puzzle
● Colorization

Predicting Image Rotations
Unsupervised Representation Learning by Predicting Image Rotations

Solving Jigsaw Puzzles
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

Colorization
Tracking Emerges by Colorizing Videos

Two Major Ideas
Pretext Task (Surrogate Loss)
● Rotation
● Jigsaw Puzzle
● Colorization
`Data-Centric` Loss Function
● Mutual Information
● Energy-based model
Aside:
Disentanglement in β TC-VAE

Autoregressive Generative Model & PixelCNN
Product chain rule of probability
Conditional Image Generation with PixelCNN Decoders

Self-Supervised without Reconstruction
● PixelCNN is almost the best likelihood model
● But log-likelihood model is flaw to encode high-level information
● Deep network learns hierarchical internal representation of the data
● Learn the dataset, not the data points
● Use high-level information to organize low-level data rather than annotate it

CPC: Contrastive Predictive Coding
Rather than directly model the distribution, extract shared information between x & c
be better for the purpose
Summarize pixels into context (or history into current state)
MI: generalized correlation function
c x

Representation Learning with Contrastive Predictive Coding

To maximize mutual information,
We model the ratio of prob. density
Simply use log-bilinear model as f, and linear map to predict future
Note: bilinear map f(u, v) = dot(u, v)
where

● It’s called InfoNCE (also called categorical cross-entropy or softmax loss)
● 1 Postive sample; N-1 Negative samples
● N is crucial to the performance
● Loose theoretical lower bound estimation
all predictions
real future
Learning Deep Representations of Fine-grained Visual Descriptions

feature extractor (patched)
context predictor (pixelCNN head)
predicted future
negative samples
downstream task
positive samples
Data-Efficient Image Recognition with Contrastive Predictive Coding

Evaluation Protocol
● Linear Evaluation
● Efficient Classification
● Transfer Learning
Labelled images in ImageNet
(14 million images)
● 1% : 12.8 per class
● 10%: 128 per class
Balanced distribution over class.

● First paper shows significant
improvement in real dataset
● Top 1 = 71.5% on ImageNet
● Label efficiency becomes
advantage

From v1 to v2
MC: Model Capacity
BU: Bottom-up Spatical Prediction
LN: Layer Normalization
RC: Random Crop
HP: Horizontal Spatical Prediction
PA: Patch Augmentation
MC & LN
● R101 → R161 & increase feat. dim
● BN statisics may cheat

Questions
Is Patch Necessary?
→ Contrastive Consistency is Important
What hyperparameters are sensitive?
→ ResNet-like & Feature Dimension

CMC: Contrastive Multiview Coding
Contrastive Multiview Coding

Qualitative Study
Remarks
● Pretext task does not always translate well
● Skip connection prevent degradation of
representation quality
● Model Capacity (depth & rep. size) strongly
influence quality
Revisiting Self-Supervised Visual Representation Learning

MoCo: Momentum Contrast
Contrastive Learning as Dictionary Look-Up
Dictionary should be large & consistent
● Context: Right Key
Momentum Contrast for Unsupervised Visual Representation Learning

Memory Bank (as Replay Buffer in RL)
● Batch size is limited by hardware
● Maintain all keys in memory O(N)
MoCo
● Dynamic queue rather than memory bank O(N) → O(K)
● Momentum update encoder rather than key
Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination

Queue
aug 1 aug 2
Key space
Fast Slow

● End-to-End: K as batch size

SimCLR: Simple Framework for Contrastive Learning
Up to date, in 2020, Hinton publishes
- Stacked Capsule AE
- SimCLR
- Subclass Distillation
A Simple Framework for Contrastive Learning of Visual Representations

● Data augmentation is crucial for
contrastive learning
● Non-linear mapping preserve the
information
● Larger batch size
● Normalized embedding

aug 2aug 1
….Representation
Compute Loss

Random Cropping + Color Distortion
Histogram of different crops in two images

Remind ReID Strong Baseline
● h: triplet embedding
● z: inference embedding & ID loss
● g: batch norm
Loss of information induced by
contrastive loss for downstream task.
Projection
Representation
Bag of Tricks and A Strong Baseline for Deep Person Re-identification

Money Talks
● Batch Size
● Training Epoch
● Simple

Transfer to smaller datasets

Training Scheme
Method Affair Backbone Batch Size Solver (lr) Epoch Machine
CPC v2 DeepMind ResNet 161 512 Adam + const lr 200 16 GPUs
CMC MIT
ResNet 50
156 ~ 240 SGD + Stepwise 240 *4 GPUs
MoCo FAIR 256 SGD + Stepwise 200 64 GPUs
SimCRL Google
Brain
8,192 LARS + Cos
schedule
1000 TPU 128
cores
Large Batch Training of Convolutional Networks
“We find in-batch negative
sampling surffices.”

Part II: Semi-Supervised Learning

Teacher-Student
Teacher-Student is kind of self-training
framework
● Train teacher with labelled data D
● Run trained teacher on unlabelled
examples D’
● Train a new student on D’
● Finetune student on D
Billion-scale semi-supervised learning for image classification

Noisy Student
● Train teacher with labelled
data D
● Run trained teacher on
unlabelled examples D’
● Train a equal size or larger
student on D & D’ with noise
added to student
→ Knowledge Expansion
Self-training with Noisy Student improves ImageNet classification

Noisy Student
Data Noise
● RandAug
→ Local Smoothness
Model Noise
● Dropout
● Stochastic Depth
→ Ensemble teacher
Others
● Data Balancing
Deep Networks with Stochastic Depth,
RandAugment: Practical automated data augmentation with a reduced search space

Noisy Student
Sup.
Semi-Sup.
Performance Margin
● Arch. complexity ~ 10%
● Extra Data ~ 3 to 5%

Semi-Supervised Training Scheme
Method Batch Size Solver Epoch Unlabelled Devices
Teacher
Student
1536 (24 x 64) SGD 1x ? IG-3B
(1500 tags)
64 GPUs
Noisy Student 512 ~ 2048 SGD 300 - 700 JFT-300M
(18291 categories)
Cloud TPU v3
2048 cores
ImageNet: 14M - 1000 classes

Affinity and Diversity
Affinity and Diversity: Quantifying Mechanisms of Data Augmentation
● Affinity: Distribution shift caused by augmentation
● Diversity: Complexity of augmentation applied
(Both are model-dependent measures)
Increase the effective unique
number of training data

Affinity and Diversity
Performance Boost Tricks
● Decaying learning rate on an appropriate
schedule
● Turning off l2 regularization at the right time in
training does not hurt performance
● Relaxing architectural constraints mid-training can
boost final performance
● Turning augmentations off and fine-tuning on
clean data can improve final test accuracy

Conclusion
Insights & Techniques
● Usage of Unlabelled or Pseudo Labelled Data
● Contrastive Loss Extracts Representative Features
● Distallation Squeezes Large-Scale Dataset
● Data Balance
● Dynamic Queue for Negative Samples
● Momentum Update for Encoder
● Non-linaer Head for Representation Preservation
● Augmentation Composition
● Constraint Relaxation during Training

References
1. Unsupervised Representation Learning by Predicting Image Rotations
2. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
3. Tracking Emerges by Colorizing Videos
4. Conditional Image Generation with PixelCNN Decoders
5. Representation Learning with Contrastive Predictive Coding
6. Learning Deep Representations of Fine-grained Visual Descriptions
7. Data-Efficient Image Recognition with Contrastive Predictive Coding
8. Contrastive Multiview Coding
9. Revisiting Self-Supervised Visual Representation Learning
10. Momentum Contrast for Unsupervised Visual Representation Learning
11. Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination
12. A Simple Framework for Contrastive Learning of Visual Representations
13. Billion-scale semi-supervised learning for image classification
14. Self-training with Noisy Student improves ImageNet classification
15. Deep Networks with Stochastic Depth
16. RandAugment: Practical automated data augmentation with a reduced search space
17. Affinity and Diversity: Quantifying Mechanisms of Data Augmentation

Learning visual representation without human label

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Learning visual representation without human label

Similar to Learning visual representation without human label (20)

More from Kai-Wen Zhao

More from Kai-Wen Zhao (8)

Recently uploaded

Recently uploaded (20)

Learning visual representation without human label