This document summarizes a research paper that proposes a new method called TTT-MAE (Test-Time Training with Masked Autoencoders) to address the problem of domain shift in visual recognition tasks. TTT-MAE uses masked autoencoders as the self-supervised pretext task in test-time training, instead of rotation prediction as used in previous work. Experimental results on datasets like ImageNet-C and ImageNet-R show that TTT-MAE achieves higher performance gains than prior methods under different types of distribution shifts. However, TTT-MAE is slower at test time than directly applying a fixed model. Future work could focus on improving efficiency and generalizing the approach to other tasks
Test-time training with masked autoencoders improves generalization under distribution shifts
1. PR-433
Gandelsman, Yossi, et al. "Test-time training with masked autoencoders." Advances in Neural Information
Processing Systems 35 (2022): 29374-29385.
주성훈, VUNO Inc.
2023. 4. 16.
3. 2. Methods
1. Research Background 3
Reference
Sun, Yu, et al. "Test-time training with self-supervision for generalization under distribution
shifts." International conference on machine learning. PMLR, 2020.
•https://yueatsprograms.github.io/ttt/home.html
/ 24
5. 2. Methods
1. Research Background 5
Problem Settings
Generalization under distribution shifts
•Generalization is intrinsically hard without access to training data from the test distribution
•The common practice is to avoid distribution shifts altogether by using a wider training
distribution that hopefully contains the test distribution – with more training data or data
augmentation.
Geirhos, Robert, et al. "Generalisation in humans and deep neural networks." Advances in neural information processing systems 31 (2018).
salt-and-pepper noise
uniform noise uniform noise
uniform noise
Hard to know the test distribution!
/ 24
7. 2. Methods
1. Research Background 7
Test time training (Sun et al., ICML, 2020)
•The self-supervised pretext task employed by TTT is rotation prediction
This task is limited in generality, because it can often be too easy or too hard.
https://yueatsprograms.github.io/ttt/home.html
/ 24
8. 2. Methods
1. Research Background 8
Autoencoders for representation learning
The most successful work is masked autoencoders (MAE)
•He, Kaiming, et al. "Masked autoencoders are scalable vision learners." CVPR. 2022.
•PR-355
Proposed method simply substitutes MAE for the self-supervised part of TTT
/ 24
11. 2. Methods
2. Methods 11
Design choices - Architecture
h •Main task (e.g. object recognition)
f
•MAE encoder: ViT
g
•MAE decoder : ViT
•ViT-Base (for ViT probing)
•Y-shaped (TTT-MAE)
/ 24
12. 2. Methods
2. Methods 12
Training-time training: 1. training encoder and decoder
f
g
•MAE encoder, deocder: ViT-large, pre-trained
for 800 epochs on ImageNet-1k
• ViT probing: train only, with frozen. Here, is a ViT-Base.
h f h
/ 24
13. 2. Methods
2. Methods 13
Training-time training: 2. training main task head
f
•MAE encoder: ViT-Large
• pre-trained for ImageNet-1k reconstruction
• cross entropy loss for classification
• encoder produced by MAE pre-training
•Augmentation: image cropping and horizontal flips
•No other augmentations (random changes in
brightness, contrast, color and sharpness )
•800 epochs
lm :
f0 :
Training set with samples
n
h
Main task head
xi
yi
/ 24
14. 2. Methods
2. Methods 14
Test-time training
g0
Test input arrives,
x
•self-supervised reconstruction loss
(pixel-wise mean squared error),
•random mask (75%)
•SGD, for 20 steps, using a momentum of
0.9, weight decay of 0.2, batch size of 128,
and fixed learning rate of 5e-3.
ls
Make a prediction on as
x h ∘ fx(x)
f0
fx h Bir
Reset the weights to and for the next test input
f0 g0 x
•By test-time training on the test inputs independently, we do not
assume that they come from the same distribution.
/ 24
15. 2. Methods
2. Methods 15
Optimizer for TTT
Figure 2: We experiment with two optimizers for TTT. MAE [19] uses AdamW for pre-training. But our results (left) show that
AdamW for TTT requires early stopping, which is unrealistic for generalization to unknown distributions without a validation
set. We instead use SGD, which keeps improving performance even after 20 steps (right).
•it simply takes the same optimizer setting as during the last epoch of training-time training of the
self-supervised task. (Original TTT)
•the learning rate schedule of MAE reaches zero by the end of pre-training.
•When Test-Time Training (TTT), excessive iterations with AdamW can negatively impact performance.
•more iterations with SGD consistently improve performance on all distribution shifts
/ 24
17. 2. Methods
3. Experimental Results 17
Calibration on out of distribution data
•15 types of corruption to the images of ImageNet-C, 5 levels of severity
• D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. ICLR, 2018
/ 24
18. 2. Methods
3. Experimental Results 18
Main results on ImageNet-C
TTT-MAE has higher performance gains in all corruptions than TTT-Rot, on top of their respective baselines.
• Joint Train: ResNet-16-layers, after joint training for rotation prediction and object recognition (baseline for TTT-Rot)
• TTT-Rot: original paper (rotation task, resnet-18)
• Baseline: pre-trained MAE encoder ViT probing (no TTT)
• TTT-MAE (red) on top of our baseline significantly improves performance.
/ 24
19. 2. Methods
3. Experimental Results 19
TTT-MAE in rotation invariant classes
• Rotation invariant class: images are usually taken from top-down views
TTT-MAE is agnostic to rotation invariance and still helps on these classes.
/ 24
20. 2. Methods
3. Experimental Results 20
Design choices - Training setup
1. Fine-tuning; train ◦ end-to-end. This
works poorly with TTT
2. ViT probing: train only, with frozen.
Here, is a ViT-Base.
3. Joint training: train both ◦ and ◦ ,
by summing their losses together. This is
used by TTT with rotation prediction. But
with MAE, it performs worse on the
ImageNet validation set
h f
h f
h
h f g f
h
f
g
Object classification
/ 24
21. 2. Methods
3. Experimental Results 21
Accuracy comparison of three designs (ViT probing, fine-tuning, joint training)
•The first three rows are only for training-time training, after which a fixed model is applied during testing.
•Joint training does not achieve satisfactory performance on most corruptions
•Fine-tuning: initially performs better than ViT probing, it is not amenable to TTT
•TTT-MAE: TTT-MAE after ViT probing, which performs the best across all corruption types
/ 24
22. 2. Methods
3. Experimental Results 22
Performance on other ImageNet variants
ImageNet-R
• ImageNet-R is a benchmark dataset for evaluating robustness of image classification
• The dataset includes images that are synthetically generated from the original
ImageNet images in a variety of ways, such as adding noise, changing lighting, or
applying artistic styles.
ImageNet-A
• Baseline: pre-trained MAE encoder ViT probing (no TTT)
• ImageNet-A is a dataset designed to test the robustness of computer vision
models against real-world, unmodified images.
• The dataset includes visually similar images to those in ImageNet but with
added challenges such as occlusion, low resolution, and unusual viewpoints.
/ 24
24. 2. Methods
4. Conclusions 24
• Main contribution
• The proposal of a new method - TTT-MAE for addressing the problem of domain shift in visual
recognition tasks.
• TTT can be viewed alternatively as one-sample unsupervised domain adaptation (UDA)
• Limitations & future works
• Slower at test time than the baseline applying a fixed model (Inference speed has not been the focus
of this paper), It might be improved through better hyper-parameters, optimizers, training techniques
and architectural designs.
• Studying the generalization of spatial autoencoding to other main tasks and test distributions beyond
object recognition and the benchmarks used in this study.
• Exploring test-time training on video streams in human-like environments, where self-supervised
learning can take advantage of past frames
Thank you.
/ 24