1. PR-422
Wenzel, Florian, et al. "Hyperparameter ensembles for robustness and uncertainty quantification." Advances
in Neural Information Processing Systems 33 (2020): 6514-6527.
주성훈, VUNO Inc.
2023. 2. 19.
3. 2. Methods
1. Research Background 3
Ensembles of neural networks
•Neural networks can form ensembles of models that are diverse and perform well on held-out data.
•Diversity is induced by the multi-modal nature of the loss landscape and randomness in initialization
and training.
•Many mechanisms exist to foster diversity, but this paper focuses on combining networks with weight
initialization and different hyperparameters.
http://florianwenzel.com/files/neurips_poster_2020.pdf
/ 22
4. 2. Methods
1. Research Background 4
Approach
•Hyper-deep ensembles
• This approach utilizes a greedy algorithm to create neural network ensembles that leverage diverse hyperparameters and random
initialization for improved performance.
•Hyper-batch ensembles
• we propose a parameterization combining that of ‘batch ensemble’ and self-tuning networks, which enables both weight and
hyperparameter diversity
/ 22
5. 2. Methods
1. Research Background 5
Previous works
•Combining the outputs of several neural networks to improve their single performance
• Since the quality of an ensemble hinges on the diversity of its members, many mechanisms were developed to generate diverse ensemble
members.
R. Zhang, et al. “Cyclical stochastic gradient mcmc for bayesian deep learning.” ICLR, 2019.
• Cyclical learning-rate schedules
• MC dropout
Y. Gal. “Dropout as a bayesian approximation: Representing model uncertainty in deep learning.” ICML, 2016.
• Random initialization (Deep ensemble)
/ 22
6. 2. Methods
1. Research Background 6
Previous works
•Batch ensemble (Wen et al., ICLR, 2019)
• Since the quality of an ensemble hinges on the diversity of its members, many mechanisms were developed to generate diverse ensemble
members.
•Batch ens
• Y. Wen, D. Tran, and J. Ba. Batchensemble: an alternative approach to
ef
fi
cient ensemble and lifelong learning. In ICLR, 2019.
•Not only does ‘Batch ensemble’ lead to a memory saving,
but it also allows for efficient minibatching, where each
datapoint may use a different ensemble member.
X[W ∘ (rksT
k )] = [(X ∘ rksT
k )W] ∘ sT
k
/ 22
8. 2. Methods
2. Methods 8
Hyper-deep ensembles
•Train κ models by random search (random weight init and random hparam). - line 1
•Apply hyper_ens to extract K models out of the κ available ones, with K « κ. - line 2
•For each selected hparam (line 3), retrain for K different weight inits (stratification). (Line 4-8)
/ 22
9. 2. Methods
2. Methods 9
Hyper-batch ensembles
•This combines ideas of batch ensembles (Wen et al., 2019), and self-tuning networks (STNs) (Mackay et al., 2018).
•Batch ens
• Y. Wen, D. Tran, and J. Ba. Batchensemble: an alternative approach to
ef
fi
cient ensemble and lifelong learning. In ICLR, 2019.
•Ensemble member k ∈ {1,…, K}
•Weight diversity: ,
rks⊤
k ukv⊤
k
/ 22
10. 2. Methods
2. Methods 10
Hyper-batch ensembles
•Can capture multiple hyperparameters (STNs only covers one hparam).
•Ensemble member k ∈ {1,…, K}
•Weight diversity: ,
rks⊤
k ukv⊤
k
M. Mackay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse. Self-tuning networks: Bilevel
optimization of hyperparameters using structured best-response functions. In ICLR, 2018.
•scalable local approximations of the best-response function
/ 22
11. 2. Methods
2. Methods 11
Hyper-batch ensembles
•Model parameters are optimized on the training set using the average member cross entropy (= the usual loss for single
models).
•Hyperparameters (more precisely the hyperparameter distribution parameters ξ) are optimized on the validation set
using the ensemble cross entropy. This directly encourages diversity between members.
•Training objective
•hyperparameter distribution
Hyperparameter distribution
for ensemble member k
/ 22
13. 2. Methods
3. Experimental Results 13
Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100
Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled
over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “
fi
xed init ens” is a shorthand for
fi
xed init hyper ens, i.e., a “row” in
Figure 2-(left). We separately compare the ef
fi
cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively
(in Appendix C.7.2, we assess the statistical signi
fi
cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types).
• Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings)
• Deep ens, single: take the best hyperparameter configuration found by the random search
procedure
NLL = −
1
N
N
∑
i=1
log p(yi |xi; θ) ECE =
1
n
n
∑
i=1
1
B ∑
j∈Bi
aj −
1
B ∑
j∈Bi
yj
• : 모델의 예측 확률, : 실제 확률
• : binning을 위한 구간 크기 ( : i번째 구간)
aj yj
B Bi
/ 22
14. 2. Methods
3. Experimental Results 14
Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100
Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled
over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “
fi
xed init ens” is a shorthand for
fi
xed init hyper ens, i.e., a “row” in
Figure 2-(left). We separately compare the ef
fi
cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively
(in Appendix C.7.2, we assess the statistical signi
fi
cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types).
• Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings)
• Deep ens, single: take the best hyperparameter configuration found by the random search
procedure
NLL = −
1
N
N
∑
i=1
log p(yi |xi; θ) ECE =
1
n
n
∑
i=1
1
B ∑
j∈Bi
aj −
1
B ∑
j∈Bi
yj
• : 모델의 예측 확률, : 실제 확률
• : binning을 위한 구간 크기 ( : i번째 구간)
aj yj
B Bi
•Metrics that depend on the predictive uncertainty—negative log-likelihood (NLL) and expected calibration error (ECE)
/ 22
15. 2. Methods
3. Experimental Results 15
Focus on small-scale models - MLP and LeNet on Fashion MNIST & CIFAR-100
Table 1: Comparison over CIFAR-100 and Fashion MNIST with MLP and LeNet models. We report means ± standard errors (over the 3 random seeds and pooled
over the 2 tuning settings). “single” stands for the best between rand search and Bayes opt. “
fi
xed init ens” is a shorthand for
fi
xed init hyper ens, i.e., a “row” in
Figure 2-(left). We separately compare the ef
fi
cient methods (3 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles respectively
(in Appendix C.7.2, we assess the statistical signi
fi
cance of those improvements with a Wilcoxon signed-rank test, paired along settings, datasets and model types).
• Mean ± standard errors (over the 3 random seeds and pooled over the 2 tuning settings)
• Deep ens, single: take the best hyperparameter configuration found by the random search
procedure
/ 22
16. 2. Methods
3. Experimental Results 16
Large-scale setting
Table 2: Performance of ResNet-20 (upper table) and Wide ResNet-28-10 (lower table) models on CIFAR-10/100. We separately
compare the ef
fi
cient methods (2 rightmost columns) and we mark in bold the best results (within one standard error). Our two
methods hyper-deep/hyper-batch ensembles improve upon deep/batch ensembles.
• Hyper-deep ens: 100 trials of random search
• Deep ens, single: take the best hyperparameter
configuration found by the random search procedure
•Ensemble size 전반에 걸쳐 성능 향상
•Fix the ensemble size to four:
/ 22
17. 2. Methods
3. Experimental Results 17
Large-scale setting
Average ensemble-member metrics:
CIFAR-100 (NLL, ACC)=(0.904, 0.788)
•The joint training in ‘hyper-batch ens’ leads to complementary ensemble members
/ 22
18. 2. Methods
3. Experimental Results 18
Training time and memory cost
•Both in terms of the number of parameters and training time, hyper-batch ens is about twice as costly as batch
ens.
/ 22
19. 2. Methods
3. Experimental Results 19
Calibration on out of distribution data
•30 types of corruptions to the images of CIFAR-10-C
• D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. ICLR, 2018
/ 22
20. 2. Methods
3. Experimental Results 20
Calibration on out of distribution data
•The mean accuracies are similar for all ensemble methods, whereas hyper-batch ens shows more robustness than
batch ens as it typically leads to smaller worst values
/ 22
22. 2. Methods
4. Conclusions 22
• Main contribution
• Hyper-deep ensembles.
• We define a greedy algorithm to form ensembles of neural networks exploiting two sources of diversity:
varied hyperparameters and random initialization. It is a simple, strong baseline that we hope will be used in
future research.
• Hyper-batch ensembles.
• Both the ensemble members and their hyperparameters are learned end-to-end in a single training
procedure, directly maximizing the ensemble performance.
• It outperforms batch ensembles while keeping their original memory compactness and efficient
minibatching for parallel training and prediction.
• Future works
• Towards more compact parametrization
• Architecture diversity
Thank you.
/ 22