3. 2. Methods
1. Research Background 3
Pre-training, fine-tuning, selecting a single model and discarding the rest
•Limitations :
•The selected model may not achieve the best performance.
•In particular, ensembling outputs of many models can outperform the best single model, albeit at a
high computational cost during inference.
•For another, fine-tuning a model on downstream tasks can sometimes reduce out-of-distribution
performance.
https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html
/ 21
4. 2. Methods
1. Research Background 4
Approach - average the weights of models fine-tuned independently
•Averaging several of these models to form a model soup requires no additional training and adds no
cost at inference time.
/ 21
5. 2. Methods
1. Research Background 5
Previous works
•Averaging model weights. (interpolated model)
• Stochastic Weight Averaging (SWA) (Izmailov et al., 2018), which averages
weights along a single optimization trajectory
• Recent work (Neyshabur et al, 2021) observes that fine-tuned models optimized independently from the same initialization lie in
the same basin of the error landscape, inspiring our method.
Wortsman, Mitchell, et al. "Robust fine-tuning of zero-shot models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Izmailov, Pavel, et al. "Averaging weights leads to wider optima and better generalization." arXiv preprint arXiv:1803.05407 (2018).
• Wortsman et al., average zero-shot and fine-tuned models,
finding improvements in-and out-of-distribution.
/ 21
6. 2. Methods
1. Research Background 6
Error landscape visualizations
• initialization.
θ0 ∈ ℝd
2 fine tuned model,
2 seeds
2 fine tuned model,
2 LR
•These results suggest that interpolating the weights of two fine-tuned solutions can improve accuracy
compared to individual models
/ 21
7. 2. Methods
1. Research Background 7
Error landscape visualizations
• initialization.
θ0 ∈ ℝd
2 fine tuned model,
2 seeds
2 fine tuned model,
2 LR
•These results suggest that models that form an angle closer to 90 degrees—may lead to higher accuracy on
the linear interpolation path.
θ1
θ2
Acc(
1
2
θ1 +
1
2
θ2) −
1
2
(Acc(θ1) + Acc(θ2))
/ 21
8. 2. Methods
1. Research Background 8
Previous works
•Pre-training and fine-tuning (weight aggregation)
• Shu et al., has attempted to improve transfer learning by using multiple pretrained models with data-dependent gating (Shu et al.,
PMLR, 2021)
• Shu, Yang, et al. "Zoo-tuning: Adaptive transfer from a zoo of models." International Conference on Machine Learning. PMLR, 2021.
•Ensembles
• Ovadia et al. (Neurips 2019) show that ensembles exhibit high accuracy under distribution shift.
• Gontijo-Lopes et al. conduct a large-scale study of ensembles, finding that higher divergence in training methodology leads to
uncorrelated errors and better ensemble accuracy.
Ovadia, Yaniv, et al. "Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift." Advances in neural information processing systems 32 (2019).
Gontijo-Lopes, Raphael, Yann Dauphin, and Ekin D. Cubuk. "No one representation to rule them all: Overlapping features of training methods." arXiv preprint arXiv:2110.12899 (2021).
/ 21
10. 2. Methods
2. Methods 10
Approach to making a model soup
θ =
𝖥
𝗂
𝗇
𝖾
𝖳
𝗎
𝗇
𝖾
(θ0, h)
•Uniform soup is constructed by averaging all fine-tuned models and so
•성능이 낮은 hyperparameter configuration에 의한 낮은 성능의 모델이 포함 될 수 있음.
θi
𝒮
= {1,...,n}
•Learned soup
•optimizes model interpolation weights by gradient-based minibatch optimization
•This procedure requires simultaniously loading all models in memory which currently hinders its use with large networks.
/ 21
12. 2. Methods
3. Experimental Results 12
•The greedy soup improves over the best model in the hyperparameter sweep by 0.7 percentage points.
•Pretraining: CLIP1) ViT-B/32
•Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet.
1) Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021.
• 5 distribution shift: inference on ImageNet-V2, ImageNet-R,
ImageNet-Sketch, ObjectNet, and ImageNet-A.
Model soups improve accuracy over the best individual fine-tuned model
/ 21
13. 2. Methods
3. Experimental Results 13
Performance of the ‘Greedy soup’ for CLIP
•The greedy soup outperforms the best individual model—with no extra training and no extra compute
during inference, we were able to produce a better model.
/ 21
14. 2. Methods
3. Experimental Results 14
Performance of the ‘Greedy soup’ for CLIP
•The greedy soup requires less models to reach the same accuracy as selecting the best individual
model on the held-out validation set.
/ 21
15. 2. Methods
3. Experimental Results 15
•The greedy soup improves over the best model in the hyperparameter sweep by 0.5 percentage points.
•Pretraining: ALIGN1) EfficientNet-L2,
•Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet.
• AdamW with weight decay of 0.1 at a resolution of 289 × 289 for 25
epochs
• Linear probe initialitzation
• Grid search over learning rate (10-6,2 x 10-6,5 x 10-6, 1 x 10-5,2 x 10-5),
data augmentation, and mixup, obtaining 12 fine-tuned models
• Greedy soup select 5 models
1) Jia, Chao, et al. "Scaling up visual and vision-language representation learning with noisy text supervision." International Conference on Machine Learning. PMLR, 2021.
Model soups improve accuracy over the best individual fine-tuned model
/ 21
16. 2. Methods
3. Experimental Results 16
•Pretraining: JFT-3B pre-trained ViT-G/14
•Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet.
Model soups improve accuracy over the best individual fine-tuned model
A model soup, surpassing the previous state of the art of 90.88% attained by the CoAtNet model
(Dai et al., 2021) while requiring 25% fewer FLOPs at inference time.
/ 21
17. 2. Methods
3. Experimental Results 17
ViT-G/14 model pre-trained on JFT-3B -> ImageNet fine-tuning
•58 models fine-tuned: We vary the learning rate, decay schedule, loss function, and minimum crop size in the data
augmentation, and optionally apply RandAugment (Cubuk et al., 2020), mixup (Zhang et al., 2017), or CutMix (Yun et
al., 2019). We also train four models with sharpness-aware minimization (SAM) (Foret et al., 2021)
•Our greedy soup procedure selects 14 of the 58 models fine-tuned.
Model selection using test set
• 5 distribution shift: inference on ImageNet-V2, ImageNet-R,
ImageNet-Sketch, ObjectNet, and ImageNet-A.
/ 21
18. 2. Methods
3. Experimental Results 18
Fine-tuning on text classification tasks (BERT, T51)
•사용된 Dataset과 task
•MRPC
• Label: Paraphrase or not
We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed.
•RTE
https://huggingface.co/datasets/SetFit/rte
• Label: entailment
• Label: 문법 오류 유무
•CoLA
• Label: negative or positive (movie reviews)
•SST(Stanford Sentiment Treebank)-2
1) Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." J. Mach. Learn. Res. 21.140 (2020): 1-67.
/ 21
19. 2. Methods
3. Experimental Results 19
Fine-tuning on text classification tasks
•Image classification만큼 뚜렷하지는 않지만, Greedy soup으로 best individual model보다 성능 향상시키는 것이 가능함
We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed.
/ 21
21. 2. Methods
4. Conclusions 21
• Main contribution
• Our results challenge the conventional procedure of selecting the best model on the
held-out validation set when fine-tuning.
• With no extra compute during inference, we are often able to produce a better
model by averaging the weights of multiple fine-tuned solutions.
• Limitation
• (1) large, heterogeneous datasets에 대해 pre-trained model에만 실험. ImageNet 22K ->
ImageNet에 대한 실험 결과가 있지만, CLIP or ALIGN -> ImageNet에 비해서 성능 향상 효과가
약함
• (2) Ensemble 기법이 model calibration을 좋게 한다는 결과가 있지만, model soups은 그렇지
않았음.
Thank you.
/ 21