2. There has been a divergence between how we do
pre-training in Vision vs NLP
NLP models are usually are pre-trained using masked or autoregressive methods:
Masked language model Autoregressive language model
Images from: Jay Alammar'blog
3. Instead the most successful pre-training in Vision is done using
contrastive methods
5. How can we make Vision pre-training more
similar to NLP pre-training?
6. Masked and autoregressive methods in NLP are at heart
Denoising autoencoders
● They are a class of autoencoder that corrupt the input and ask the model to
predict the un-corrupted version
● For images this would mean applying geometric transformations, color
transformations, masking pixels, shuffluling pixels, etc
7. Masked image modelling (MIM) has been done using
convolutions
The paper Context Encoders: Feature Learning by Inpainting (2016), is the
pioneer of masked image modelling, using convolutional neural networks to fill out
masked part of an image.
CNN Encoder CNN
Decoder
8. But the results are very poor…...
So the authors need to add an adversarial loss (GAN) to get better visual results
but even then fine-tuning accuracies were low for today’s standard
10. How to tokenize images the same way as text?
The paper AN IMAGE IS WORTH 16X16 WORDS introduces the main way to
tokenize images for transformers, just split then into patches of 16 by 16 pixels
and pass then through a linear layer
11. (MAE) Masked Autoencoders Are Scalable Vision
Learners
● With the introduction of ViT, we can do masked image modelling the same
way we do mask language modelling in BERT.
● Unlike BERT, MAE uses an asymmetric design. The encoder only operates
on the masked input (No [MASKED] token) and a lightweight decoder that
reconstructs the full signal from the latent representation and [MASKED]
tokens.
21. Results
The authors do self-supervised pre-training on the ImageNet-1K (IN1K) training
set. Then they do supervised training to evaluate the representations with (i)
end-to-end fine-tuning or (ii) linear probing.
Baseline model: ViT-Large:
● ViT-Large (ViT-L/16) is the backbone in their ablation study.
● ViT-L is very big and tends to overfit.
● It is very hard to train supervised ViT-L from scratch and a good recipe with
strong regularization is needed .
22. We need high masking ratios
● The optimal ratios are surprisingly
high. The ratio of 75% is good for both
linear probing and fine-tuning.
● This is in contrast with BERT(15%)
and similar works in CV(20% - 50%)
● For linear probing, the accuracy
increases steadily with the masking
ratio until 75% masking: the accuracy
gap is up to ∼20% (54.6% vs.73.5%).
For fine-tuning, the results are less
sensitive to the ratios, and a wide
range of masking ratios (40–80%)
work well.
23. Mask Token
● If the encoder uses mask tokens, it
performs worse: its accuracy drops
by 14% in linear probing.
● By removing the mask token from
the encoder, They constrain the
encoder to always see real patches
and thus improve accuracy.
24. Reconstruction target
● Using pixels with normalization improves accuracy.
● In another variant, the authors perform PCA in the patch space and use the
largest PCA coefficients (96 here) as the target. Doing so degrades accuracy.
● The authors also compare an MAE variant that predicts tokens, the target
used in BEiT. Specifically for this variant, they use the DALLE pre-trained
dVAE as the tokenizer, following BEiT.
● The dVAE tokenizer requires one more pre-training stage, which may depend
on extra data (250M images). The dVAE encoder is a large convolutional
network (40% FLOPs of ViT-L) and adds nontrivial overhead.
26. Transfer learning experiments
● Object detection and instance segmentation
○ Mask R-CNN is finetuned on COCO. The ViT backbone is adapted to work with FPN.
● Semantic segmentation:
○ Experiments on ADE20K use UperNet and ViT as backbone.
28. Masked Autoencoders As Spatiotemporal Learners
● Basic idea: extend MAE to spatiotemporal learning
29. How to mask spatiotemporal data?
(a): Random sampling that is spacetime-agnostic. (b): Space-only random
sampling, broadcasted to all time steps (“tube” masking). (c): Time-only random
sampling, broadcasted to all spatial locations (“frame” masking). (d): Block-wise
sampling in spacetime, removing large regions (“cube” masking).
30. What is the optimal masking ratio for spatiotemporal data?
Optimal is ~90% much higher than in images.