SlideShare a Scribd company logo
MLP-Mixer: An all-MLP Architecture for Vision
Tolstikhin et al., arXiv, 2021
Google Research, Brain Team
Team members: 김상현, 신정호, 고형권
Presenter: 허다운
June, 13. 2021
[Image Processing Team]
Contents
1. Introduction: Overview
2. Model Architecture
3. Experiments
4. Conclusion
2
1. Introduction: Overview
- As the history of computer vision demonstrates, the availability of larger datasets coupled with
increased computational capacity often leads to a paradigm shift.
3
- While Convolutional Neural Networks (CNNs) have been the de-facto standard for computer
vision, recently Vision Transformers [2] (ViT), an alternative based on self-attention layers,
attained state-of-the-art performance.
- ViT continues the long-lasting trend of removing hand-crafted visual features and inductive
biases from models and relies further on learning from raw data.
- Author proposed MLP-Mixer architecture, a competitive but conceptually and technically simple
alternative, that does not use convolutions or self-attention.
2. Mixer Architecture
4
Figure 1: MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers
contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a
GELU nonlinearity. Other components include: skip-connections,dropout, and layer norm on the channels
Questions?
3. Experiments
1) Accuracy on the downstream task.
2) Total computational cost of pre-training.
3) Throughput at inference time.
6
- Pre-training data
○ ILSVRC2021 ImageNet, and ImageNet-21k: a superset of
ILSVRC2012 that contains 21k classes and 14M images.
○ JFT-300M: 300M examples and 18k classes
- To assess performance at even larger scale
- Downstream tasks
○ ILSVRC2012 “ImageNet”: 1.3M training examples, 1k
classes with the original validation labels and cleaned-up
ReaL labels
○ CIFAR-10/100: 50k examples, 10.100 classes
○ Oxford-IIIT Pets: 3.7k examples, 36 classes
○ Oxford Flowers-102: 2k examples, 102 classes
○ (Evaluation) Visual Task Adaptation Benchmark
(VTAB-1k): 19 diverse datasets, each with 1k training
examples
3. Experiments
7
- Fine-tuning details
○ SGD with momentum, batch size 512, gradient clipping at global norm 1, a cosine learning rate schedule with a linear warmup,
and do not use weight decay when fine-tuning
○ Apply fine-tune at higher resolutions with respect to those used during pre-training
- Since the patch resolution is fixed, the number of input patches (say from S to S’) increased and thus requires modifying the
shape of Mixer’s token-mixing MLP blocks.
- Mixer models are fine-tuned at resolution 224 and 448 on the datasets with small and large input images respectively.
- Metrics
○ Computational cost
- Total pre-training time on TPU-v3 accelerators
- Combines three relevant factors: the theoretical FLOPs for each training setup, the computational efficiency on the relevant
training hardware, and the data efficiency
- Throughput in images/sec/core on TPU-v3
○ Model quality
- Top-1 downstream accuracy after fine-tuning
3. Experiments
8
- Models
3. Experiments
9
- Models
- Attention-based model
- Vision Transformer (ViT) [2]
- HaloNet [3]: ResNet-like structure with local self-attention layers instead of 3x3 convolutions
(HaloNet-H4(base 128, Conv-12))
- Convolution-based model
- Big Transfer (BiT) [4]: ResNets optimized for transfer learning
- NFNets [5]: normalizer-free ResNets with several optimizations for ImageNet classification (NFNet-F4+)
- MPL [6] (EfficientNet architectures): pre-trained at very large-scale on JFT-300M images, using
meta-pseudo labelling from ImageNet instead of the original labels
- ALIGN [7] (EfficientNet architectures): pre-train image encoder and language encoder on noisy web
image text pairs in a contrastive way
Questions?
3. Experiments: Main Results
11
3. Experiments: Main Results
12
Figure 2: Left: ImageNet accuracy/training cost Pareto frontier (dashed line) for the SOTA models in Table 2. Models are pre-trained on
ImageNet-21k, or JFT (labelled, or pseudo-labelled for MPL), orweb image text pairs. Mixer is as good as these extremely performant ResNets, ViTs,
and hybrid models, and sits on frontier with HaloNet, ViT, NFNet, and MPL. Right: Mixer (solid) catches or exceeds BiT (dotted) and ViT (dashed)
as the data size grows. Every point on a curve uses the same pre-training compute; they correspond to pre-training on 3%, 10%, 30%, and 100% of
JFT-300M for 233, 70, 23, and 7 epochs, respectively. Additional points at∼3B correspond to pre-training on an even larger JFT-3B dataset for the
same number of total steps. Mixer improves more rapidly withdata than ResNets, or even ViT. The gap between large Mixer and ViT models shrinks.
3. Experiments: The role of the pre-training dataset size
13
Figure 2: Mixer (solid) catches or exceeds BiT (dotted) and ViT
(dashed) as the data size grows. Every point on a curve uses the
same pre-training compute; they correspond to pre-training on 3%,
10%, 30%, and 100% of JFT-300M for 233, 70, 23, and 7 epochs,
respectively. Additional points at∼3B correspond to pre-training on
an even larger JFT-3B dataset for the same number of total steps.
Mixer improves more rapidly withdata than ResNets, or even ViT.
The gap between large Mixer and ViT models shrinks.
3. Experiments: The role of the model scale
14
Table 3: Performance of Mixer and other models from the literature across various model and pre-training dataset scales. “Avg. 5” denotes the average
performance across five downstream tasks.Mixer and ViT models are averaged over three fine-tuning runs, standard deviations are smaller than 0.15.
(‡) Extrapolated from the numbers reported for the same models pre-trained on JFT-300M without extra regularization. ( ) Numbers provided by
authors of Dosovitskiy et al. [14] through personal communication. Rows are sorted by throughput.
3. Experiments: The role of the model scale
15
Table 3: Performance of Mixer and other models from the literature across various model and pre-training dataset scales. “Avg. 5” denotes the average
performance across five downstream tasks.Mixer and ViT models are averaged over three fine-tuning runs, standard deviations are smaller than 0.15.
(‡) Extrapolated from the numbers reported for the same models pre-trained on JFT-300M without extra regularization. ( ) Numbers provided by
authors of Dosovitskiy et al. [14] through personal communication. Rows are sorted by throughput.
3. Experiments: The role of the model scale
16
3. Experiments: Invariance to input permutations
17
3. Experiments: Visualization
18
4. Conclusion
19
- On the practical side, it may be useful to study the features learned by the model and identify the
main differences (if any) from those learned by CNNs and Transformers.
- Describe a very simple architecture for visio
- Demonstrate that it is as good as existing state-of-the-art methods in terms of the trade-off
between accuracy and computational resources required for training and inference
- On the theoretical side, authors would like to understand the inductive biases hidden in these
various features and eventually their role in generalization.
Thank You
21
References
- [1] Tolstikhin, Ilya, et al. "Mlp-mixer: An all-mlp architecture for vision." arXiv preprint arXiv:2105.01601 (2021).
- [2] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint
arXiv:2010.11929 (2020).
- [3] Vaswani, Ashish, et al. "Scaling local self-attention for parameter efficient visual backbones." arXiv preprint arXiv:2103.12731
(2021).
- [4] Kolesnikov, Alexander, et al. "Big transfer (bit): General visual representation learning." arXiv preprint arXiv:1912.11370 6.2
(2019): 8.
- [5] Brock, Andrew, et al. "High-performance large-scale image recognition without normalization." arXiv preprint arXiv:2102.06171
(2021).
- [6] Pham, Hieu, et al. "Meta pseudo labels." arXiv preprint arXiv:2003.10580 (2020).
- [7] Jia, Chao, et al. "Scaling up visual and vision-language representation learning with noisy text supervision." arXiv preprint
arXiv:2102.05918 (2021).

More Related Content

What's hot

PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Jinwon Lee
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
Sungchul Kim
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
Sunghoon Joo
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
Sunghoon Joo
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
Jinwon Lee
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
Jinwon Lee
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
Jinwon Lee
 
Transformer based approaches for visual representation learning
Transformer based approaches for visual representation learningTransformer based approaches for visual representation learning
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
Sangmin Woo
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
Jinwon Lee
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Jinwon Lee
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
Jinwon Lee
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
Tahmid Abtahi
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search
DaeJin Kim
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
Sunghoon Joo
 
Road Network Extraction using Satellite Imagery.
Road Network Extraction using Satellite Imagery.Road Network Extraction using Satellite Imagery.
Road Network Extraction using Satellite Imagery.
SUMITRAJ312049
 

What's hot (20)

PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
Transformer based approaches for visual representation learning
Transformer based approaches for visual representation learningTransformer based approaches for visual representation learning
Transformer based approaches for visual representation learning
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
 
Road Network Extraction using Satellite Imagery.
Road Network Extraction using Satellite Imagery.Road Network Extraction using Satellite Imagery.
Road Network Extraction using Satellite Imagery.
 

Similar to Mlp mixer image_process_210613 deeplearning paper review!

自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用
Ryo Iwaki
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
ijtsrd
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
Manas Das
 
Y34147151
Y34147151Y34147151
Y34147151
IJERA Editor
 
1801.06434
1801.064341801.06434
1801.06434
emil_laurence
 
ImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural NetworksImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural Networks
Willy Marroquin (WillyDevNET)
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
ijtsrd
 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning Training
Subhajit Sahu
 
Garbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning TechniquesGarbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning Techniques
IRJET Journal
 
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
ssuser4b1f48
 
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Lviv Data Science Summer School
 
Comparison of convolutional neural network models for user’s facial recognition
Comparison of convolutional neural network models for user’s facial recognitionComparison of convolutional neural network models for user’s facial recognition
Comparison of convolutional neural network models for user’s facial recognition
IJECEIAES
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
Sunghoon Joo
 
IRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature DescriptorIRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET Journal
 
Conv xg
Conv xgConv xg
Conv xg
Nueng Math
 
IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...
IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...
IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...
IRJET Journal
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
eSAT Publishing House
 
Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaiton
RyuichiKanoh
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 

Similar to Mlp mixer image_process_210613 deeplearning paper review! (20)

自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
Y34147151
Y34147151Y34147151
Y34147151
 
1801.06434
1801.064341801.06434
1801.06434
 
ImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural NetworksImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural Networks
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning Training
 
algorithms
algorithmsalgorithms
algorithms
 
Garbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning TechniquesGarbage Classification Using Deep Learning Techniques
Garbage Classification Using Deep Learning Techniques
 
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
 
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
 
Comparison of convolutional neural network models for user’s facial recognition
Comparison of convolutional neural network models for user’s facial recognitionComparison of convolutional neural network models for user’s facial recognition
Comparison of convolutional neural network models for user’s facial recognition
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
IRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature DescriptorIRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature Descriptor
 
Conv xg
Conv xgConv xg
Conv xg
 
IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...
IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...
IRJET- A Review on Data Dependent Label Distribution Learning for Age Estimat...
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaiton
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 

More from taeseon ryu

VoxelNet
VoxelNetVoxelNet
VoxelNet
taeseon ryu
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
taeseon ryu
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
taeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
taeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
taeseon ryu
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
taeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
taeseon ryu
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
taeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
taeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
taeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
taeseon ryu
 
mPLUG
mPLUGmPLUG
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
taeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
taeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
taeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
taeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
taeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

Recently uploaded

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 

Recently uploaded (20)

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 

Mlp mixer image_process_210613 deeplearning paper review!

  • 1. MLP-Mixer: An all-MLP Architecture for Vision Tolstikhin et al., arXiv, 2021 Google Research, Brain Team Team members: 김상현, 신정호, 고형권 Presenter: 허다운 June, 13. 2021 [Image Processing Team]
  • 2. Contents 1. Introduction: Overview 2. Model Architecture 3. Experiments 4. Conclusion 2
  • 3. 1. Introduction: Overview - As the history of computer vision demonstrates, the availability of larger datasets coupled with increased computational capacity often leads to a paradigm shift. 3 - While Convolutional Neural Networks (CNNs) have been the de-facto standard for computer vision, recently Vision Transformers [2] (ViT), an alternative based on self-attention layers, attained state-of-the-art performance. - ViT continues the long-lasting trend of removing hand-crafted visual features and inductive biases from models and relies further on learning from raw data. - Author proposed MLP-Mixer architecture, a competitive but conceptually and technically simple alternative, that does not use convolutions or self-attention.
  • 4. 2. Mixer Architecture 4 Figure 1: MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections,dropout, and layer norm on the channels
  • 6. 3. Experiments 1) Accuracy on the downstream task. 2) Total computational cost of pre-training. 3) Throughput at inference time. 6 - Pre-training data ○ ILSVRC2021 ImageNet, and ImageNet-21k: a superset of ILSVRC2012 that contains 21k classes and 14M images. ○ JFT-300M: 300M examples and 18k classes - To assess performance at even larger scale - Downstream tasks ○ ILSVRC2012 “ImageNet”: 1.3M training examples, 1k classes with the original validation labels and cleaned-up ReaL labels ○ CIFAR-10/100: 50k examples, 10.100 classes ○ Oxford-IIIT Pets: 3.7k examples, 36 classes ○ Oxford Flowers-102: 2k examples, 102 classes ○ (Evaluation) Visual Task Adaptation Benchmark (VTAB-1k): 19 diverse datasets, each with 1k training examples
  • 7. 3. Experiments 7 - Fine-tuning details ○ SGD with momentum, batch size 512, gradient clipping at global norm 1, a cosine learning rate schedule with a linear warmup, and do not use weight decay when fine-tuning ○ Apply fine-tune at higher resolutions with respect to those used during pre-training - Since the patch resolution is fixed, the number of input patches (say from S to S’) increased and thus requires modifying the shape of Mixer’s token-mixing MLP blocks. - Mixer models are fine-tuned at resolution 224 and 448 on the datasets with small and large input images respectively. - Metrics ○ Computational cost - Total pre-training time on TPU-v3 accelerators - Combines three relevant factors: the theoretical FLOPs for each training setup, the computational efficiency on the relevant training hardware, and the data efficiency - Throughput in images/sec/core on TPU-v3 ○ Model quality - Top-1 downstream accuracy after fine-tuning
  • 9. 3. Experiments 9 - Models - Attention-based model - Vision Transformer (ViT) [2] - HaloNet [3]: ResNet-like structure with local self-attention layers instead of 3x3 convolutions (HaloNet-H4(base 128, Conv-12)) - Convolution-based model - Big Transfer (BiT) [4]: ResNets optimized for transfer learning - NFNets [5]: normalizer-free ResNets with several optimizations for ImageNet classification (NFNet-F4+) - MPL [6] (EfficientNet architectures): pre-trained at very large-scale on JFT-300M images, using meta-pseudo labelling from ImageNet instead of the original labels - ALIGN [7] (EfficientNet architectures): pre-train image encoder and language encoder on noisy web image text pairs in a contrastive way
  • 11. 3. Experiments: Main Results 11
  • 12. 3. Experiments: Main Results 12 Figure 2: Left: ImageNet accuracy/training cost Pareto frontier (dashed line) for the SOTA models in Table 2. Models are pre-trained on ImageNet-21k, or JFT (labelled, or pseudo-labelled for MPL), orweb image text pairs. Mixer is as good as these extremely performant ResNets, ViTs, and hybrid models, and sits on frontier with HaloNet, ViT, NFNet, and MPL. Right: Mixer (solid) catches or exceeds BiT (dotted) and ViT (dashed) as the data size grows. Every point on a curve uses the same pre-training compute; they correspond to pre-training on 3%, 10%, 30%, and 100% of JFT-300M for 233, 70, 23, and 7 epochs, respectively. Additional points at∼3B correspond to pre-training on an even larger JFT-3B dataset for the same number of total steps. Mixer improves more rapidly withdata than ResNets, or even ViT. The gap between large Mixer and ViT models shrinks.
  • 13. 3. Experiments: The role of the pre-training dataset size 13 Figure 2: Mixer (solid) catches or exceeds BiT (dotted) and ViT (dashed) as the data size grows. Every point on a curve uses the same pre-training compute; they correspond to pre-training on 3%, 10%, 30%, and 100% of JFT-300M for 233, 70, 23, and 7 epochs, respectively. Additional points at∼3B correspond to pre-training on an even larger JFT-3B dataset for the same number of total steps. Mixer improves more rapidly withdata than ResNets, or even ViT. The gap between large Mixer and ViT models shrinks.
  • 14. 3. Experiments: The role of the model scale 14 Table 3: Performance of Mixer and other models from the literature across various model and pre-training dataset scales. “Avg. 5” denotes the average performance across five downstream tasks.Mixer and ViT models are averaged over three fine-tuning runs, standard deviations are smaller than 0.15. (‡) Extrapolated from the numbers reported for the same models pre-trained on JFT-300M without extra regularization. ( ) Numbers provided by authors of Dosovitskiy et al. [14] through personal communication. Rows are sorted by throughput.
  • 15. 3. Experiments: The role of the model scale 15 Table 3: Performance of Mixer and other models from the literature across various model and pre-training dataset scales. “Avg. 5” denotes the average performance across five downstream tasks.Mixer and ViT models are averaged over three fine-tuning runs, standard deviations are smaller than 0.15. (‡) Extrapolated from the numbers reported for the same models pre-trained on JFT-300M without extra regularization. ( ) Numbers provided by authors of Dosovitskiy et al. [14] through personal communication. Rows are sorted by throughput.
  • 16. 3. Experiments: The role of the model scale 16
  • 17. 3. Experiments: Invariance to input permutations 17
  • 19. 4. Conclusion 19 - On the practical side, it may be useful to study the features learned by the model and identify the main differences (if any) from those learned by CNNs and Transformers. - Describe a very simple architecture for visio - Demonstrate that it is as good as existing state-of-the-art methods in terms of the trade-off between accuracy and computational resources required for training and inference - On the theoretical side, authors would like to understand the inductive biases hidden in these various features and eventually their role in generalization.
  • 21. 21 References - [1] Tolstikhin, Ilya, et al. "Mlp-mixer: An all-mlp architecture for vision." arXiv preprint arXiv:2105.01601 (2021). - [2] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). - [3] Vaswani, Ashish, et al. "Scaling local self-attention for parameter efficient visual backbones." arXiv preprint arXiv:2103.12731 (2021). - [4] Kolesnikov, Alexander, et al. "Big transfer (bit): General visual representation learning." arXiv preprint arXiv:1912.11370 6.2 (2019): 8. - [5] Brock, Andrew, et al. "High-performance large-scale image recognition without normalization." arXiv preprint arXiv:2102.06171 (2021). - [6] Pham, Hieu, et al. "Meta pseudo labels." arXiv preprint arXiv:2003.10580 (2020). - [7] Jia, Chao, et al. "Scaling up visual and vision-language representation learning with noisy text supervision." arXiv preprint arXiv:2102.05918 (2021).