SlideShare a Scribd company logo
1 of 21
Download to read offline
PR-411
Wortsman, Mitchell, et al. "Model soups: averaging weights of multiple fine-tuned models improves
accuracy without increasing inference time." International Conference on Machine Learning. PMLR, 2022.
주성훈, VUNO Inc.
2022. 11. 13.
1. Research Background
2. Methods
1. Research Background 3
Pre-training, fine-tuning, selecting a single model and discarding the rest
•Limitations :
•The selected model may not achieve the best performance.
•In particular, ensembling outputs of many models can outperform the best single model, albeit at a
high computational cost during inference.
•For another, fine-tuning a model on downstream tasks can sometimes reduce out-of-distribution
performance.
https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html
/ 21
2. Methods
1. Research Background 4
Approach - average the weights of models fine-tuned independently
•Averaging several of these models to form a model soup requires no additional training and adds no
cost at inference time.
/ 21
2. Methods
1. Research Background 5
Previous works
•Averaging model weights. (interpolated model)
• Stochastic Weight Averaging (SWA) (Izmailov et al., 2018), which averages
weights along a single optimization trajectory
• Recent work (Neyshabur et al, 2021) observes that fine-tuned models optimized independently from the same initialization lie in
the same basin of the error landscape, inspiring our method.
Wortsman, Mitchell, et al. "Robust fine-tuning of zero-shot models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Izmailov, Pavel, et al. "Averaging weights leads to wider optima and better generalization." arXiv preprint arXiv:1803.05407 (2018).
• Wortsman et al., average zero-shot and fine-tuned models,
finding improvements in-and out-of-distribution.
/ 21
2. Methods
1. Research Background 6
Error landscape visualizations
• initialization.
θ0 ∈ ℝd
2 fine tuned model,
2 seeds
2 fine tuned model,
2 LR
•These results suggest that interpolating the weights of two fine-tuned solutions can improve accuracy
compared to individual models
/ 21
2. Methods
1. Research Background 7
Error landscape visualizations
• initialization.
θ0 ∈ ℝd
2 fine tuned model,
2 seeds
2 fine tuned model,
2 LR
•These results suggest that models that form an angle closer to 90 degrees—may lead to higher accuracy on
the linear interpolation path.
θ1
θ2
Acc(
1
2
θ1 +
1
2
θ2) −
1
2
(Acc(θ1) + Acc(θ2))
/ 21
2. Methods
1. Research Background 8
Previous works
•Pre-training and fine-tuning (weight aggregation)
• Shu et al., has attempted to improve transfer learning by using multiple pretrained models with data-dependent gating (Shu et al.,
PMLR, 2021)
• Shu, Yang, et al. "Zoo-tuning: Adaptive transfer from a zoo of models." International Conference on Machine Learning. PMLR, 2021.
•Ensembles
• Ovadia et al. (Neurips 2019) show that ensembles exhibit high accuracy under distribution shift.
• Gontijo-Lopes et al. conduct a large-scale study of ensembles, finding that higher divergence in training methodology leads to
uncorrelated errors and better ensemble accuracy.
Ovadia, Yaniv, et al. "Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift." Advances in neural information processing systems 32 (2019).
Gontijo-Lopes, Raphael, Yann Dauphin, and Ekin D. Cubuk. "No one representation to rule them all: Overlapping features of training methods." arXiv preprint arXiv:2110.12899 (2021).
/ 21
2. Methods
2. Methods
2. Methods 10
Approach to making a model soup
θ =
𝖥
𝗂
𝗇
𝖾
𝖳
𝗎
𝗇
𝖾
(θ0, h)
•Uniform soup is constructed by averaging all fine-tuned models and so
•성능이 낮은 hyperparameter configuration에 의한 낮은 성능의 모델이 포함 될 수 있음.
θi
𝒮
= {1,...,n}
•Learned soup
•optimizes model interpolation weights by gradient-based minibatch optimization
•This procedure requires simultaniously loading all models in memory which currently hinders its use with large networks.
/ 21
3. Experimental Results
2. Methods
3. Experimental Results 12
•The greedy soup improves over the best model in the hyperparameter sweep by 0.7 percentage points.
•Pretraining: CLIP1) ViT-B/32
•Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet.
1) Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021.
• 5 distribution shift: inference on ImageNet-V2, ImageNet-R,
ImageNet-Sketch, ObjectNet, and ImageNet-A.
Model soups improve accuracy over the best individual fine-tuned model
/ 21
2. Methods
3. Experimental Results 13
Performance of the ‘Greedy soup’ for CLIP
•The greedy soup outperforms the best individual model—with no extra training and no extra compute
during inference, we were able to produce a better model.
/ 21
2. Methods
3. Experimental Results 14
Performance of the ‘Greedy soup’ for CLIP
•The greedy soup requires less models to reach the same accuracy as selecting the best individual
model on the held-out validation set.
/ 21
2. Methods
3. Experimental Results 15
•The greedy soup improves over the best model in the hyperparameter sweep by 0.5 percentage points.
•Pretraining: ALIGN1) EfficientNet-L2,
•Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet.
• AdamW with weight decay of 0.1 at a resolution of 289 × 289 for 25
epochs
• Linear probe initialitzation
• Grid search over learning rate (10-6,2 x 10-6,5 x 10-6, 1 x 10-5,2 x 10-5),
data augmentation, and mixup, obtaining 12 fine-tuned models
• Greedy soup select 5 models
1) Jia, Chao, et al. "Scaling up visual and vision-language representation learning with noisy text supervision." International Conference on Machine Learning. PMLR, 2021.
Model soups improve accuracy over the best individual fine-tuned model
/ 21
2. Methods
3. Experimental Results 16
•Pretraining: JFT-3B pre-trained ViT-G/14
•Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet.
Model soups improve accuracy over the best individual fine-tuned model
A model soup, surpassing the previous state of the art of 90.88% attained by the CoAtNet model
(Dai et al., 2021) while requiring 25% fewer FLOPs at inference time.
/ 21
2. Methods
3. Experimental Results 17
ViT-G/14 model pre-trained on JFT-3B -> ImageNet fine-tuning
•58 models fine-tuned: We vary the learning rate, decay schedule, loss function, and minimum crop size in the data
augmentation, and optionally apply RandAugment (Cubuk et al., 2020), mixup (Zhang et al., 2017), or CutMix (Yun et
al., 2019). We also train four models with sharpness-aware minimization (SAM) (Foret et al., 2021)
•Our greedy soup procedure selects 14 of the 58 models fine-tuned.
Model selection using test set
• 5 distribution shift: inference on ImageNet-V2, ImageNet-R,
ImageNet-Sketch, ObjectNet, and ImageNet-A.
/ 21
2. Methods
3. Experimental Results 18
Fine-tuning on text classification tasks (BERT, T51)
•사용된 Dataset과 task
•MRPC
• Label: Paraphrase or not
We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed.
•RTE
https://huggingface.co/datasets/SetFit/rte
• Label: entailment
• Label: 문법 오류 유무
•CoLA
• Label: negative or positive (movie reviews)
•SST(Stanford Sentiment Treebank)-2
1) Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." J. Mach. Learn. Res. 21.140 (2020): 1-67.
/ 21
2. Methods
3. Experimental Results 19
Fine-tuning on text classification tasks
•Image classification만큼 뚜렷하지는 않지만, Greedy soup으로 best individual model보다 성능 향상시키는 것이 가능함
We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed.
/ 21
4. Conclusion
2. Methods
4. Conclusions 21
• Main contribution
• Our results challenge the conventional procedure of selecting the best model on the
held-out validation set when fine-tuning.
• With no extra compute during inference, we are often able to produce a better
model by averaging the weights of multiple fine-tuned solutions.
• Limitation
• (1) large, heterogeneous datasets에 대해 pre-trained model에만 실험. ImageNet 22K ->
ImageNet에 대한 실험 결과가 있지만, CLIP or ALIGN -> ImageNet에 비해서 성능 향상 효과가
약함
• (2) Ensemble 기법이 model calibration을 좋게 한다는 결과가 있지만, model soups은 그렇지
않았음.
Thank you.
/ 21

More Related Content

What's hot

What's hot (16)

PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
 
Basic Concepts of Entanglement Measures
Basic Concepts of Entanglement MeasuresBasic Concepts of Entanglement Measures
Basic Concepts of Entanglement Measures
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 4장. 모델 훈련
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 4장. 모델 훈련[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 4장. 모델 훈련
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 4장. 모델 훈련
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 
An introduction on normalizing flows
An introduction on normalizing flowsAn introduction on normalizing flows
An introduction on normalizing flows
 
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
 
Object Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNNObject Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNN
 
Content Based Medical Image Retrieval System
Content Based Medical Image Retrieval SystemContent Based Medical Image Retrieval System
Content Based Medical Image Retrieval System
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Cs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative ModelCs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative Model
 
A survey of deep learning approaches to medical applications
A survey of deep learning approaches to medical applicationsA survey of deep learning approaches to medical applications
A survey of deep learning approaches to medical applications
 
Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation Learning
 

Similar to PR-411: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
consistency regularization for generative adversarial networks_review
consistency regularization for generative adversarial networks_reviewconsistency regularization for generative adversarial networks_review
consistency regularization for generative adversarial networks_review
Yoonho Na
 
What Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? pptWhat Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? ppt
taeseon ryu
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013
Pedro Lopes
 

Similar to PR-411: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time (20)

The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
consistency regularization for generative adversarial networks_review
consistency regularization for generative adversarial networks_reviewconsistency regularization for generative adversarial networks_review
consistency regularization for generative adversarial networks_review
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic SegmentationReview : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
 
Mnist soln
Mnist solnMnist soln
Mnist soln
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdf
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_vision
 
What Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? pptWhat Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? ppt
 
PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But Faster
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
Conformer review
Conformer reviewConformer review
Conformer review
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
 
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
 
Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
Efficient de cvpr_2020_paper
Efficient de cvpr_2020_paperEfficient de cvpr_2020_paper
Efficient de cvpr_2020_paper
 
Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
 

More from Sunghoon Joo

More from Sunghoon Joo (15)

PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learning
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document reranking
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseases
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture Search
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
 

Recently uploaded

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
pritamlangde
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 

Recently uploaded (20)

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Electromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptxElectromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptx
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdf
 
Signal Processing and Linear System Analysis
Signal Processing and Linear System AnalysisSignal Processing and Linear System Analysis
Signal Processing and Linear System Analysis
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 

PR-411: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

  • 1. PR-411 Wortsman, Mitchell, et al. "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." International Conference on Machine Learning. PMLR, 2022. 주성훈, VUNO Inc. 2022. 11. 13.
  • 3. 2. Methods 1. Research Background 3 Pre-training, fine-tuning, selecting a single model and discarding the rest •Limitations : •The selected model may not achieve the best performance. •In particular, ensembling outputs of many models can outperform the best single model, albeit at a high computational cost during inference. •For another, fine-tuning a model on downstream tasks can sometimes reduce out-of-distribution performance. https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html / 21
  • 4. 2. Methods 1. Research Background 4 Approach - average the weights of models fine-tuned independently •Averaging several of these models to form a model soup requires no additional training and adds no cost at inference time. / 21
  • 5. 2. Methods 1. Research Background 5 Previous works •Averaging model weights. (interpolated model) • Stochastic Weight Averaging (SWA) (Izmailov et al., 2018), which averages weights along a single optimization trajectory • Recent work (Neyshabur et al, 2021) observes that fine-tuned models optimized independently from the same initialization lie in the same basin of the error landscape, inspiring our method. Wortsman, Mitchell, et al. "Robust fine-tuning of zero-shot models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. Izmailov, Pavel, et al. "Averaging weights leads to wider optima and better generalization." arXiv preprint arXiv:1803.05407 (2018). • Wortsman et al., average zero-shot and fine-tuned models, finding improvements in-and out-of-distribution. / 21
  • 6. 2. Methods 1. Research Background 6 Error landscape visualizations • initialization. θ0 ∈ ℝd 2 fine tuned model, 2 seeds 2 fine tuned model, 2 LR •These results suggest that interpolating the weights of two fine-tuned solutions can improve accuracy compared to individual models / 21
  • 7. 2. Methods 1. Research Background 7 Error landscape visualizations • initialization. θ0 ∈ ℝd 2 fine tuned model, 2 seeds 2 fine tuned model, 2 LR •These results suggest that models that form an angle closer to 90 degrees—may lead to higher accuracy on the linear interpolation path. θ1 θ2 Acc( 1 2 θ1 + 1 2 θ2) − 1 2 (Acc(θ1) + Acc(θ2)) / 21
  • 8. 2. Methods 1. Research Background 8 Previous works •Pre-training and fine-tuning (weight aggregation) • Shu et al., has attempted to improve transfer learning by using multiple pretrained models with data-dependent gating (Shu et al., PMLR, 2021) • Shu, Yang, et al. "Zoo-tuning: Adaptive transfer from a zoo of models." International Conference on Machine Learning. PMLR, 2021. •Ensembles • Ovadia et al. (Neurips 2019) show that ensembles exhibit high accuracy under distribution shift. • Gontijo-Lopes et al. conduct a large-scale study of ensembles, finding that higher divergence in training methodology leads to uncorrelated errors and better ensemble accuracy. Ovadia, Yaniv, et al. "Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift." Advances in neural information processing systems 32 (2019). Gontijo-Lopes, Raphael, Yann Dauphin, and Ekin D. Cubuk. "No one representation to rule them all: Overlapping features of training methods." arXiv preprint arXiv:2110.12899 (2021). / 21
  • 10. 2. Methods 2. Methods 10 Approach to making a model soup θ = 𝖥 𝗂 𝗇 𝖾 𝖳 𝗎 𝗇 𝖾 (θ0, h) •Uniform soup is constructed by averaging all fine-tuned models and so •성능이 낮은 hyperparameter configuration에 의한 낮은 성능의 모델이 포함 될 수 있음. θi 𝒮 = {1,...,n} •Learned soup •optimizes model interpolation weights by gradient-based minibatch optimization •This procedure requires simultaniously loading all models in memory which currently hinders its use with large networks. / 21
  • 12. 2. Methods 3. Experimental Results 12 •The greedy soup improves over the best model in the hyperparameter sweep by 0.7 percentage points. •Pretraining: CLIP1) ViT-B/32 •Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet. 1) Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International Conference on Machine Learning. PMLR, 2021. • 5 distribution shift: inference on ImageNet-V2, ImageNet-R, ImageNet-Sketch, ObjectNet, and ImageNet-A. Model soups improve accuracy over the best individual fine-tuned model / 21
  • 13. 2. Methods 3. Experimental Results 13 Performance of the ‘Greedy soup’ for CLIP •The greedy soup outperforms the best individual model—with no extra training and no extra compute during inference, we were able to produce a better model. / 21
  • 14. 2. Methods 3. Experimental Results 14 Performance of the ‘Greedy soup’ for CLIP •The greedy soup requires less models to reach the same accuracy as selecting the best individual model on the held-out validation set. / 21
  • 15. 2. Methods 3. Experimental Results 15 •The greedy soup improves over the best model in the hyperparameter sweep by 0.5 percentage points. •Pretraining: ALIGN1) EfficientNet-L2, •Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet. • AdamW with weight decay of 0.1 at a resolution of 289 × 289 for 25 epochs • Linear probe initialitzation • Grid search over learning rate (10-6,2 x 10-6,5 x 10-6, 1 x 10-5,2 x 10-5), data augmentation, and mixup, obtaining 12 fine-tuned models • Greedy soup select 5 models 1) Jia, Chao, et al. "Scaling up visual and vision-language representation learning with noisy text supervision." International Conference on Machine Learning. PMLR, 2021. Model soups improve accuracy over the best individual fine-tuned model / 21
  • 16. 2. Methods 3. Experimental Results 16 •Pretraining: JFT-3B pre-trained ViT-G/14 •Fine-tuning: hyperparameter sweep for the fine-tuning each model on ImageNet. Model soups improve accuracy over the best individual fine-tuned model A model soup, surpassing the previous state of the art of 90.88% attained by the CoAtNet model (Dai et al., 2021) while requiring 25% fewer FLOPs at inference time. / 21
  • 17. 2. Methods 3. Experimental Results 17 ViT-G/14 model pre-trained on JFT-3B -> ImageNet fine-tuning •58 models fine-tuned: We vary the learning rate, decay schedule, loss function, and minimum crop size in the data augmentation, and optionally apply RandAugment (Cubuk et al., 2020), mixup (Zhang et al., 2017), or CutMix (Yun et al., 2019). We also train four models with sharpness-aware minimization (SAM) (Foret et al., 2021) •Our greedy soup procedure selects 14 of the 58 models fine-tuned. Model selection using test set • 5 distribution shift: inference on ImageNet-V2, ImageNet-R, ImageNet-Sketch, ObjectNet, and ImageNet-A. / 21
  • 18. 2. Methods 3. Experimental Results 18 Fine-tuning on text classification tasks (BERT, T51) •사용된 Dataset과 task •MRPC • Label: Paraphrase or not We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed. •RTE https://huggingface.co/datasets/SetFit/rte • Label: entailment • Label: 문법 오류 유무 •CoLA • Label: negative or positive (movie reviews) •SST(Stanford Sentiment Treebank)-2 1) Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." J. Mach. Learn. Res. 21.140 (2020): 1-67. / 21
  • 19. 2. Methods 3. Experimental Results 19 Fine-tuning on text classification tasks •Image classification만큼 뚜렷하지는 않지만, Greedy soup으로 best individual model보다 성능 향상시키는 것이 가능함 We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed. / 21
  • 21. 2. Methods 4. Conclusions 21 • Main contribution • Our results challenge the conventional procedure of selecting the best model on the held-out validation set when fine-tuning. • With no extra compute during inference, we are often able to produce a better model by averaging the weights of multiple fine-tuned solutions. • Limitation • (1) large, heterogeneous datasets에 대해 pre-trained model에만 실험. ImageNet 22K -> ImageNet에 대한 실험 결과가 있지만, CLIP or ALIGN -> ImageNet에 비해서 성능 향상 효과가 약함 • (2) Ensemble 기법이 model calibration을 좋게 한다는 결과가 있지만, model soups은 그렇지 않았음. Thank you. / 21