SlideShare a Scribd company logo
1 of 22
Download to read offline
PR-422
Wenzel, Florian, et al. "Hyperparameter ensembles for robustness and uncertainty quantification." Advances
in Neural Information Processing Systems 33 (2020): 6514-6527.
주성훈, VUNO Inc.
2023. 2. 19.
1. Research Background
2. Methods
1. Research Background 3
Ensembles of neural networks
•Neural networks can form ensembles of models that are diverse and perform well on held-out data.
•Diversity is induced by the multi-modal nature of the loss landscape and randomness in initialization
and training.
•Many mechanisms exist to foster diversity, but this paper focuses on combining networks with weight
initialization and different hyperparameters.
http://florianwenzel.com/files/neurips_poster_2020.pdf
/ 22
2. Methods
1. Research Background 4
Approach
•Hyper-deep ensembles
• This approach utilizes a greedy algorithm to create neural network ensembles that leverage diverse hyperparameters and random
initialization for improved performance.
•Hyper-batch ensembles
• we propose a parameterization combining that of ‘batch ensemble’ and self-tuning networks, which enables both weight and
hyperparameter diversity
/ 22
2. Methods
1. Research Background 5
Previous works
•Combining the outputs of several neural networks to improve their single performance
• Since the quality of an ensemble hinges on the diversity of its members, many mechanisms were developed to generate diverse ensemble
members.
R. Zhang, et al. “Cyclical stochastic gradient mcmc for bayesian deep learning.” ICLR, 2019.
• Cyclical learning-rate schedules
• MC dropout
Y. Gal. “Dropout as a bayesian approximation: Representing model uncertainty in deep learning.” ICML, 2016.
• Random initialization (Deep ensemble)
/ 22
2. Methods
1. Research Background 6
Previous works
•Batch ensemble (Wen et al., ICLR, 2019)
• Since the quality of an ensemble hinges on the diversity of its members, many mechanisms were developed to generate diverse ensemble
members.
•Batch ens
• Y. Wen, D. Tran, and J. Ba. Batchensemble: an alternative approach to
ef
fi
cient ensemble and lifelong learning. In ICLR, 2019.
•Not only does ‘Batch ensemble’ lead to a memory saving,
but it also allows for efficient minibatching, where each
datapoint may use a different ensemble member.
X[W ∘ (rksT
k )] = [(X ∘ rksT
k )W] ∘ sT
k
/ 22
2. Methods
2. Methods
2. Methods 8
Hyper-deep ensembles
•Train κ models by random search (random weight init and random hparam). - line 1
•Apply hyper_ens to extract K models out of the κ available ones, with K « κ. - line 2
•For each selected hparam (line 3), retrain for K different weight inits (stratification). (Line 4-8)
/ 22
2. Methods
2. Methods 9
Hyper-batch ensembles
•This combines ideas of batch ensembles (Wen et al., 2019), and self-tuning networks (STNs) (Mackay et al., 2018).
•Batch ens
• Y. Wen, D. Tran, and J. Ba. Batchensemble: an alternative approach to
ef
fi
cient ensemble and lifelong learning. In ICLR, 2019.
•Ensemble member k ∈ {1,…, K}
•Weight diversity: ,
rks⊤
k ukv⊤
k
/ 22
2. Methods
2. Methods 10
Hyper-batch ensembles
•Can capture multiple hyperparameters (STNs only covers one hparam).
•Ensemble member k ∈ {1,…, K}
•Weight diversity: ,
rks⊤
k ukv⊤
k
M. Mackay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse. Self-tuning networks: Bilevel
optimization of hyperparameters using structured best-response functions. In ICLR, 2018.
•scalable local approximations of the best-response function
/ 22
2. Methods
2. Methods 11
Hyper-batch ensembles
•Model parameters are optimized on the training set using the average member cross entropy (= the usual loss for single
models).
•Hyperparameters (more precisely the hyperparameter distribution parameters ξ) are optimized on the validation set
using the ensemble cross entropy. This directly encourages diversity between members.
•Training objective
•hyperparameter distribution
Hyperparameter distribution
for ensemble member k
/ 22
3. Experimental Results
2. Methods
3. Experimental Results 13
Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100
Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled
over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “
fi
xed init ens” is a shorthand for
fi
xed init hyper ens, i.e., a “row” in
Figure 2-(left). We separately compare the ef
fi
cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively
(in Appendix C.7.2, we assess the statistical signi
fi
cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types).
• Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings)
• Deep ens, single: take the best hyperparameter configuration found by the random search
procedure
NLL = −
1
N
N
∑
i=1
log p(yi |xi; θ) ECE =
1
n
n
∑
i=1
1
B ∑
j∈Bi
aj −
1
B ∑
j∈Bi
yj
• : 모델의 예측 확률, : 실제 확률
• : binning을 위한 구간 크기 ( : i번째 구간)
aj yj
B Bi
/ 22
2. Methods
3. Experimental Results 14
Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100
Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled
over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “
fi
xed init ens” is a shorthand for
fi
xed init hyper ens, i.e., a “row” in
Figure 2-(left). We separately compare the ef
fi
cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively
(in Appendix C.7.2, we assess the statistical signi
fi
cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types).
• Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings)
• Deep ens, single: take the best hyperparameter configuration found by the random search
procedure
NLL = −
1
N
N
∑
i=1
log p(yi |xi; θ) ECE =
1
n
n
∑
i=1
1
B ∑
j∈Bi
aj −
1
B ∑
j∈Bi
yj
• : 모델의 예측 확률, : 실제 확률
• : binning을 위한 구간 크기 ( : i번째 구간)
aj yj
B Bi
•Metrics that depend on the predictive uncertainty—negative log-likelihood (NLL) and expected calibration error (ECE)
/ 22
2. Methods
3. Experimental Results 15
Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100
Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled
over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “
fi
xed init ens” is a shorthand for
fi
xed init hyper ens, i.e., a “row” in
Figure 2-(left). We separately compare the ef
fi
cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively
(in Appendix C.7.2, we assess the statistical signi
fi
cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types).
• Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings)
• Deep ens, single: take the best hyperparameter configuration found by the random search
procedure
/ 22
2. Methods
3. Experimental Results 16
Large-scale setting
Table 2: Performance of ResNet-20 (upper table) and Wide ResNet-28-10 (lower table) models on CIFAR-10/100. We separately
compare the ef
fi
cient methods (2 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles.
• Hyper-deep ens: 100 trials of random search
• Deep ens, single: take the best hyperparameter
configuration found by the random search procedure
•Ensemble size 전반에 걸쳐 성능 향상
•Fix the ensemble size to four:
/ 22
2. Methods
3. Experimental Results 17
Large-scale setting
Average ensemble-member metrics:
CIFAR-100 (NLL, ACC)=(0.904, 0.788)
•The joint training in ‘hyper-batch ens’ leads to complementary ensemble members
/ 22
2. Methods
3. Experimental Results 18
Training time and memory cost
•Both in terms of the number of parameters and training time, hyper-batch ens is about twice as costly as batch
ens.
/ 22
2. Methods
3. Experimental Results 19
Calibration on out of distribution data
•30 types of corruptions to the images of CIFAR-10-C
• D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. ICLR, 2018
/ 22
2. Methods
3. Experimental Results 20
Calibration on out of distribution data
•The mean accuracies are similar for all ensemble methods, whereas hyper-batch ens shows more robustness than
batch ens as it typically leads to smaller worst values
/ 22
4. Conclusion
2. Methods
4. Conclusions 22
• Main contribution
• Hyper-deep ensembles.
• We define a greedy algorithm to form ensembles of neural networks exploiting two sources of diversity:
varied hyperparameters and random initialization. It is a simple, strong baseline that we hope will be used in
future research.
• Hyper-batch ensembles.
• Both the ensemble members and their hyperparameters are learned end-to-end in a single training
procedure, directly maximizing the ensemble performance.
• It outperforms batch ensembles while keeping their original memory compactness and efficient
minibatching for parallel training and prediction.
• Future works
• Towards more compact parametrization
• Architecture diversity
Thank you.
/ 22

More Related Content

What's hot

Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
Self supervised learning
Self supervised learningSelf supervised learning
Self supervised learning哲东 郑
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in PythonImry Kissos
 
Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)yukihiro domae
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
 
[DL輪読会]Deep Learning 第20章 深層生成モデル
[DL輪読会]Deep Learning 第20章 深層生成モデル[DL輪読会]Deep Learning 第20章 深層生成モデル
[DL輪読会]Deep Learning 第20章 深層生成モデルDeep Learning JP
 
Neural collaborative filtering-발표
Neural collaborative filtering-발표Neural collaborative filtering-발표
Neural collaborative filtering-발표hyunsung lee
 
PRML輪読#6
PRML輪読#6PRML輪読#6
PRML輪読#6matsuolab
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Manohar Mukku
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and ApplicationsHoang Nguyen
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Massimo Quadrana
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
[DL輪読会]Deep Learning 第16章 深層学習のための構造化確率モデル
[DL輪読会]Deep Learning 第16章 深層学習のための構造化確率モデル[DL輪読会]Deep Learning 第16章 深層学習のための構造化確率モデル
[DL輪読会]Deep Learning 第16章 深層学習のための構造化確率モデルDeep Learning JP
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance홍배 김
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기Woong won Lee
 
[DL輪読会]Deep Learning 第18章 分配関数との対峙
[DL輪読会]Deep Learning 第18章 分配関数との対峙[DL輪読会]Deep Learning 第18章 分配関数との対峙
[DL輪読会]Deep Learning 第18章 分配関数との対峙Deep Learning JP
 
Chapter 8 ボルツマンマシン - 深層学習本読み会
Chapter 8 ボルツマンマシン - 深層学習本読み会Chapter 8 ボルツマンマシン - 深層学習本読み会
Chapter 8 ボルツマンマシン - 深層学習本読み会Taikai Takeda
 

What's hot (20)

Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Self supervised learning
Self supervised learningSelf supervised learning
Self supervised learning
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
[DL輪読会]Deep Learning 第20章 深層生成モデル
[DL輪読会]Deep Learning 第20章 深層生成モデル[DL輪読会]Deep Learning 第20章 深層生成モデル
[DL輪読会]Deep Learning 第20章 深層生成モデル
 
Neural collaborative filtering-발표
Neural collaborative filtering-발표Neural collaborative filtering-발표
Neural collaborative filtering-발표
 
PRML輪読#6
PRML輪読#6PRML輪読#6
PRML輪読#6
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and Applications
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
[DL輪読会]Deep Learning 第16章 深層学習のための構造化確率モデル
[DL輪読会]Deep Learning 第16章 深層学習のための構造化確率モデル[DL輪読会]Deep Learning 第16章 深層学習のための構造化確率モデル
[DL輪読会]Deep Learning 第16章 深層学習のための構造化確率モデル
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
[DL輪読会]Deep Learning 第18章 分配関数との対峙
[DL輪読会]Deep Learning 第18章 分配関数との対峙[DL輪読会]Deep Learning 第18章 分配関数との対峙
[DL輪読会]Deep Learning 第18章 分配関数との対峙
 
Chapter 8 ボルツマンマシン - 深層学習本読み会
Chapter 8 ボルツマンマシン - 深層学習本読み会Chapter 8 ボルツマンマシン - 深層学習本読み会
Chapter 8 ボルツマンマシン - 深層学習本読み会
 

Similar to PR422_hyper-deep ensembles.pdf

PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionSunghoon Joo
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...DineshRaj Goud
 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataijcsit
 
Combinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic AlgorithmCombinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic AlgorithmVivek Maheshwari
 
A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsinfopapers
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
Microarray Data Classification Using Support Vector Machine
Microarray Data Classification Using Support Vector MachineMicroarray Data Classification Using Support Vector Machine
Microarray Data Classification Using Support Vector MachineCSCJournals
 
Missing Value imputation, Poor man's
Missing Value imputation, Poor man'sMissing Value imputation, Poor man's
Missing Value imputation, Poor man'sLeonardo Auslender
 
(Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning (Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning Omkar Rane
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsinfopapers
 
PPT file
PPT filePPT file
PPT filebutest
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxFaridAliMousa1
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...Sunghoon Joo
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative StudyIJMER
 
Deep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp SegmentationDeep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp Segmentationmultimediaeval
 

Similar to PR422_hyper-deep ensembles.pdf (20)

PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
Ijetr021251
Ijetr021251Ijetr021251
Ijetr021251
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...
 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray data
 
Combinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic AlgorithmCombinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic Algorithm
 
A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernels
 
F017533540
F017533540F017533540
F017533540
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
Microarray Data Classification Using Support Vector Machine
Microarray Data Classification Using Support Vector MachineMicroarray Data Classification Using Support Vector Machine
Microarray Data Classification Using Support Vector Machine
 
Missing Value imputation, Poor man's
Missing Value imputation, Poor man'sMissing Value imputation, Poor man's
Missing Value imputation, Poor man's
 
(Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning (Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
PPT file
PPT filePPT file
PPT file
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptx
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
 
Deep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp SegmentationDeep Conditional Adversarial learning for polyp Segmentation
Deep Conditional Adversarial learning for polyp Segmentation
 
P1121133727
P1121133727P1121133727
P1121133727
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 

More from Sunghoon Joo

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterSunghoon Joo
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersSunghoon Joo
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...Sunghoon Joo
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.Sunghoon Joo
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningSunghoon Joo
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningSunghoon Joo
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...Sunghoon Joo
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...Sunghoon Joo
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingSunghoon Joo
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...Sunghoon Joo
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationSunghoon Joo
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesSunghoon Joo
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchSunghoon Joo
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesSunghoon Joo
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...Sunghoon Joo
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...Sunghoon Joo
 

More from Sunghoon Joo (18)

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But Faster
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learning
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document reranking
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseases
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture Search
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
 

Recently uploaded

UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxkalpana413121
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxMustafa Ahmed
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information SystemsAnge Felix NSANZIYERA
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelDrAjayKumarYadav4
 
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesLinux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesRashidFaridChishti
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxpritamlangde
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementDr. Deepak Mudgal
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdfKamal Acharya
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257subhasishdas79
 

Recently uploaded (20)

UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information Systems
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata Model
 
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesLinux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth Reinforcement
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Signal Processing and Linear System Analysis
Signal Processing and Linear System AnalysisSignal Processing and Linear System Analysis
Signal Processing and Linear System Analysis
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdf
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257
 

PR422_hyper-deep ensembles.pdf

  • 1. PR-422 Wenzel, Florian, et al. "Hyperparameter ensembles for robustness and uncertainty quantification." Advances in Neural Information Processing Systems 33 (2020): 6514-6527. 주성훈, VUNO Inc. 2023. 2. 19.
  • 3. 2. Methods 1. Research Background 3 Ensembles of neural networks •Neural networks can form ensembles of models that are diverse and perform well on held-out data. •Diversity is induced by the multi-modal nature of the loss landscape and randomness in initialization and training. •Many mechanisms exist to foster diversity, but this paper focuses on combining networks with weight initialization and different hyperparameters. http://florianwenzel.com/files/neurips_poster_2020.pdf / 22
  • 4. 2. Methods 1. Research Background 4 Approach •Hyper-deep ensembles • This approach utilizes a greedy algorithm to create neural network ensembles that leverage diverse hyperparameters and random initialization for improved performance. •Hyper-batch ensembles • we propose a parameterization combining that of ‘batch ensemble’ and self-tuning networks, which enables both weight and hyperparameter diversity / 22
  • 5. 2. Methods 1. Research Background 5 Previous works •Combining the outputs of several neural networks to improve their single performance • Since the quality of an ensemble hinges on the diversity of its members, many mechanisms were developed to generate diverse ensemble members. R. Zhang, et al. “Cyclical stochastic gradient mcmc for bayesian deep learning.” ICLR, 2019. • Cyclical learning-rate schedules • MC dropout Y. Gal. “Dropout as a bayesian approximation: Representing model uncertainty in deep learning.” ICML, 2016. • Random initialization (Deep ensemble) / 22
  • 6. 2. Methods 1. Research Background 6 Previous works •Batch ensemble (Wen et al., ICLR, 2019) • Since the quality of an ensemble hinges on the diversity of its members, many mechanisms were developed to generate diverse ensemble members. •Batch ens • Y. Wen, D. Tran, and J. Ba. Batchensemble: an alternative approach to ef fi cient ensemble and lifelong learning. In ICLR, 2019. •Not only does ‘Batch ensemble’ lead to a memory saving, but it also allows for efficient minibatching, where each datapoint may use a different ensemble member. X[W ∘ (rksT k )] = [(X ∘ rksT k )W] ∘ sT k / 22
  • 8. 2. Methods 2. Methods 8 Hyper-deep ensembles •Train κ models by random search (random weight init and random hparam). - line 1 •Apply hyper_ens to extract K models out of the κ available ones, with K « κ. - line 2 •For each selected hparam (line 3), retrain for K different weight inits (stratification). (Line 4-8) / 22
  • 9. 2. Methods 2. Methods 9 Hyper-batch ensembles •This combines ideas of batch ensembles (Wen et al., 2019), and self-tuning networks (STNs) (Mackay et al., 2018). •Batch ens • Y. Wen, D. Tran, and J. Ba. Batchensemble: an alternative approach to ef fi cient ensemble and lifelong learning. In ICLR, 2019. •Ensemble member k ∈ {1,…, K} •Weight diversity: , rks⊤ k ukv⊤ k / 22
  • 10. 2. Methods 2. Methods 10 Hyper-batch ensembles •Can capture multiple hyperparameters (STNs only covers one hparam). •Ensemble member k ∈ {1,…, K} •Weight diversity: , rks⊤ k ukv⊤ k M. Mackay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. In ICLR, 2018. •scalable local approximations of the best-response function / 22
  • 11. 2. Methods 2. Methods 11 Hyper-batch ensembles •Model parameters are optimized on the training set using the average member cross entropy (= the usual loss for single models). •Hyperparameters (more precisely the hyperparameter distribution parameters ξ) are optimized on the validation set using the ensemble cross entropy. This directly encourages diversity between members. •Training objective •hyperparameter distribution Hyperparameter distribution for ensemble member k / 22
  • 13. 2. Methods 3. Experimental Results 13 Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100 Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “ fi xed init ens” is a shorthand for fi xed init hyper ens, i.e., a “row” in Figure 2-(left). We separately compare the ef fi cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively (in Appendix C.7.2, we assess the statistical signi fi cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types). • Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings) • Deep ens, single: take the best hyperparameter configuration found by the random search procedure NLL = − 1 N N ∑ i=1 log p(yi |xi; θ) ECE = 1 n n ∑ i=1 1 B ∑ j∈Bi aj − 1 B ∑ j∈Bi yj • : 모델의 예측 확률, : 실제 확률 • : binning을 위한 구간 크기 ( : i번째 구간) aj yj B Bi / 22
  • 14. 2. Methods 3. Experimental Results 14 Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100 Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “ fi xed init ens” is a shorthand for fi xed init hyper ens, i.e., a “row” in Figure 2-(left). We separately compare the ef fi cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively (in Appendix C.7.2, we assess the statistical signi fi cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types). • Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings) • Deep ens, single: take the best hyperparameter configuration found by the random search procedure NLL = − 1 N N ∑ i=1 log p(yi |xi; θ) ECE = 1 n n ∑ i=1 1 B ∑ j∈Bi aj − 1 B ∑ j∈Bi yj • : 모델의 예측 확률, : 실제 확률 • : binning을 위한 구간 크기 ( : i번째 구간) aj yj B Bi •Metrics that depend on the predictive uncertainty—negative log-likelihood (NLL) and expected calibration error (ECE) / 22
  • 15. 2. Methods 3. Experimental Results 15 Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100 Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “ fi xed init ens” is a shorthand for fi xed init hyper ens, i.e., a “row” in Figure 2-(left). We separately compare the ef fi cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively (in Appendix C.7.2, we assess the statistical signi fi cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types). • Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings) • Deep ens, single: take the best hyperparameter configuration found by the random search procedure / 22
  • 16. 2. Methods 3. Experimental Results 16 Large-scale setting Table 2: Performance of ResNet-20 (upper table) and Wide ResNet-28-10 (lower table) models on CIFAR-10/100. We separately compare the ef fi cient methods (2 rightmost columns) and we mark in bold the best results (within one standard error). Our two methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles. • Hyper-deep ens: 100 trials of random search • Deep ens, single: take the best hyperparameter configuration found by the random search procedure •Ensemble size 전반에 걸쳐 성능 향상 •Fix the ensemble size to four: / 22
  • 17. 2. Methods 3. Experimental Results 17 Large-scale setting Average ensemble-member metrics: CIFAR-100 (NLL, ACC)=(0.904, 0.788) •The joint training in ‘hyper-batch ens’ leads to complementary ensemble members / 22
  • 18. 2. Methods 3. Experimental Results 18 Training time and memory cost •Both in terms of the number of parameters and training time, hyper-batch ens is about twice as costly as batch ens. / 22
  • 19. 2. Methods 3. Experimental Results 19 Calibration on out of distribution data •30 types of corruptions to the images of CIFAR-10-C • D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. ICLR, 2018 / 22
  • 20. 2. Methods 3. Experimental Results 20 Calibration on out of distribution data •The mean accuracies are similar for all ensemble methods, whereas hyper-batch ens shows more robustness than batch ens as it typically leads to smaller worst values / 22
  • 22. 2. Methods 4. Conclusions 22 • Main contribution • Hyper-deep ensembles. • We define a greedy algorithm to form ensembles of neural networks exploiting two sources of diversity: varied hyperparameters and random initialization. It is a simple, strong baseline that we hope will be used in future research. • Hyper-batch ensembles. • Both the ensemble members and their hyperparameters are learned end-to-end in a single training procedure, directly maximizing the ensemble performance. • It outperforms batch ensembles while keeping their original memory compactness and efficient minibatching for parallel training and prediction. • Future works • Towards more compact parametrization • Architecture diversity Thank you. / 22