안녕하세요 딥러닝논문읽기모임 입니다!
오늘 소개드릴 논문은 MLP-Mixer라는 제목의 논문입니다.
해당 논문은 아직 아카이브에만 올라와 있고 구글 브레인팀에서 발표한 논문입니다.
CNN은 컴퓨터 비전에서 널리 사용하고 있는 레이어지만, 최근에는 Transformer와 같은 네트워크도 비전영역에 들어오기 시작하고, 몇몇 분야에서는 SOTA를 달성하기도 했습니다. 해당 논문은 Multi layer perceptron만을 사용하여 최신 논문들과 경쟁력이 있는 결과를 달성하는대 성공하였습니다.
논문에 디테일한 설명을 이미지처리팀 허다운님이 자세한 리뷰를 도와주셨습니다! 오늘도 많은 관심 미리 감사드립니다!
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Mlp mixer image_process_210613 deeplearning paper review!
1. MLP-Mixer: An all-MLP Architecture for Vision
Tolstikhin et al., arXiv, 2021
Google Research, Brain Team
Team members: 김상현, 신정호, 고형권
Presenter: 허다운
June, 13. 2021
[Image Processing Team]
3. 1. Introduction: Overview
- As the history of computer vision demonstrates, the availability of larger datasets coupled with
increased computational capacity often leads to a paradigm shift.
3
- While Convolutional Neural Networks (CNNs) have been the de-facto standard for computer
vision, recently Vision Transformers [2] (ViT), an alternative based on self-attention layers,
attained state-of-the-art performance.
- ViT continues the long-lasting trend of removing hand-crafted visual features and inductive
biases from models and relies further on learning from raw data.
- Author proposed MLP-Mixer architecture, a competitive but conceptually and technically simple
alternative, that does not use convolutions or self-attention.
4. 2. Mixer Architecture
4
Figure 1: MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers
contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a
GELU nonlinearity. Other components include: skip-connections,dropout, and layer norm on the channels
6. 3. Experiments
1) Accuracy on the downstream task.
2) Total computational cost of pre-training.
3) Throughput at inference time.
6
- Pre-training data
○ ILSVRC2021 ImageNet, and ImageNet-21k: a superset of
ILSVRC2012 that contains 21k classes and 14M images.
○ JFT-300M: 300M examples and 18k classes
- To assess performance at even larger scale
- Downstream tasks
○ ILSVRC2012 “ImageNet”: 1.3M training examples, 1k
classes with the original validation labels and cleaned-up
ReaL labels
○ CIFAR-10/100: 50k examples, 10.100 classes
○ Oxford-IIIT Pets: 3.7k examples, 36 classes
○ Oxford Flowers-102: 2k examples, 102 classes
○ (Evaluation) Visual Task Adaptation Benchmark
(VTAB-1k): 19 diverse datasets, each with 1k training
examples
7. 3. Experiments
7
- Fine-tuning details
○ SGD with momentum, batch size 512, gradient clipping at global norm 1, a cosine learning rate schedule with a linear warmup,
and do not use weight decay when fine-tuning
○ Apply fine-tune at higher resolutions with respect to those used during pre-training
- Since the patch resolution is fixed, the number of input patches (say from S to S’) increased and thus requires modifying the
shape of Mixer’s token-mixing MLP blocks.
- Mixer models are fine-tuned at resolution 224 and 448 on the datasets with small and large input images respectively.
- Metrics
○ Computational cost
- Total pre-training time on TPU-v3 accelerators
- Combines three relevant factors: the theoretical FLOPs for each training setup, the computational efficiency on the relevant
training hardware, and the data efficiency
- Throughput in images/sec/core on TPU-v3
○ Model quality
- Top-1 downstream accuracy after fine-tuning
9. 3. Experiments
9
- Models
- Attention-based model
- Vision Transformer (ViT) [2]
- HaloNet [3]: ResNet-like structure with local self-attention layers instead of 3x3 convolutions
(HaloNet-H4(base 128, Conv-12))
- Convolution-based model
- Big Transfer (BiT) [4]: ResNets optimized for transfer learning
- NFNets [5]: normalizer-free ResNets with several optimizations for ImageNet classification (NFNet-F4+)
- MPL [6] (EfficientNet architectures): pre-trained at very large-scale on JFT-300M images, using
meta-pseudo labelling from ImageNet instead of the original labels
- ALIGN [7] (EfficientNet architectures): pre-train image encoder and language encoder on noisy web
image text pairs in a contrastive way
12. 3. Experiments: Main Results
12
Figure 2: Left: ImageNet accuracy/training cost Pareto frontier (dashed line) for the SOTA models in Table 2. Models are pre-trained on
ImageNet-21k, or JFT (labelled, or pseudo-labelled for MPL), orweb image text pairs. Mixer is as good as these extremely performant ResNets, ViTs,
and hybrid models, and sits on frontier with HaloNet, ViT, NFNet, and MPL. Right: Mixer (solid) catches or exceeds BiT (dotted) and ViT (dashed)
as the data size grows. Every point on a curve uses the same pre-training compute; they correspond to pre-training on 3%, 10%, 30%, and 100% of
JFT-300M for 233, 70, 23, and 7 epochs, respectively. Additional points at∼3B correspond to pre-training on an even larger JFT-3B dataset for the
same number of total steps. Mixer improves more rapidly withdata than ResNets, or even ViT. The gap between large Mixer and ViT models shrinks.
13. 3. Experiments: The role of the pre-training dataset size
13
Figure 2: Mixer (solid) catches or exceeds BiT (dotted) and ViT
(dashed) as the data size grows. Every point on a curve uses the
same pre-training compute; they correspond to pre-training on 3%,
10%, 30%, and 100% of JFT-300M for 233, 70, 23, and 7 epochs,
respectively. Additional points at∼3B correspond to pre-training on
an even larger JFT-3B dataset for the same number of total steps.
Mixer improves more rapidly withdata than ResNets, or even ViT.
The gap between large Mixer and ViT models shrinks.
14. 3. Experiments: The role of the model scale
14
Table 3: Performance of Mixer and other models from the literature across various model and pre-training dataset scales. “Avg. 5” denotes the average
performance across five downstream tasks.Mixer and ViT models are averaged over three fine-tuning runs, standard deviations are smaller than 0.15.
(‡) Extrapolated from the numbers reported for the same models pre-trained on JFT-300M without extra regularization. ( ) Numbers provided by
authors of Dosovitskiy et al. [14] through personal communication. Rows are sorted by throughput.
15. 3. Experiments: The role of the model scale
15
Table 3: Performance of Mixer and other models from the literature across various model and pre-training dataset scales. “Avg. 5” denotes the average
performance across five downstream tasks.Mixer and ViT models are averaged over three fine-tuning runs, standard deviations are smaller than 0.15.
(‡) Extrapolated from the numbers reported for the same models pre-trained on JFT-300M without extra regularization. ( ) Numbers provided by
authors of Dosovitskiy et al. [14] through personal communication. Rows are sorted by throughput.
19. 4. Conclusion
19
- On the practical side, it may be useful to study the features learned by the model and identify the
main differences (if any) from those learned by CNNs and Transformers.
- Describe a very simple architecture for visio
- Demonstrate that it is as good as existing state-of-the-art methods in terms of the trade-off
between accuracy and computational resources required for training and inference
- On the theoretical side, authors would like to understand the inductive biases hidden in these
various features and eventually their role in generalization.