2. Contents
1. Introduction & Related Work
2. (Conclusion)
3. Empirical Evaluation
4. Pitfalls and Limitations
5. Conclusion
2
3. Introduction & Related Work
3
• What is ‘Model Calibration’?
• The accuracy with which the scores provided by the model reflect its predictive uncertainty
• It is important in safety-critical applications, such as autonomous driving, medical diagnosis, forecasting
• On Calibration of Modern Neural Networks, ICML 2017. (PR-075)
https://arxiv.org/abs/1706.04599
PR-075: On Calibration of Modern Neural Networks (2017)
4. Introduction & Related Work
4
• On Calibration of Modern Neural Networks, ICML 2017. (PR-075)
• Expected Calibration Error (ECE)
https://arxiv.org/abs/1706.04599
PR-075: On Calibration of Modern Neural Networks (2017)
Video by Geoff Pleiss
Neural
Network
Input 𝑥
Logits 𝑧 𝑥
𝑒𝑧 𝑥
σ𝑖 𝑒𝑧𝑖 𝑥
Probability Ƹ
𝑝 𝑥
0 ≤ Ƹ
𝑝 𝑥 < 0.2 0.2 ≤ Ƹ
𝑝 𝑥 < 0.4 0.4 ≤ Ƹ
𝑝 𝑥 < 0.6 0.6 ≤ Ƹ
𝑝 𝑥 < 0.8 0.8 ≤ Ƹ
𝑝 𝑥 < 1
5. Introduction & Related Work
5
• On Calibration of Modern Neural Networks, ICML 2017. (PR-075)
• Expected Calibration Error (ECE)
0 ≤ Ƹ
𝑝 𝑥 < 0.2 0.2 ≤ Ƹ
𝑝 𝑥 < 0.4 0.4 ≤ Ƹ
𝑝 𝑥 < 0.6 0.6 ≤ Ƹ
𝑝 𝑥 < 0.8 0.8 ≤ Ƹ
𝑝 𝑥 < 1
X O O O
O
O
O
https://arxiv.org/abs/1706.04599
PR-075: On Calibration of Modern Neural Networks (2017)
Video by Geoff Pleiss
Avg. conf : 0.86
Accuracy : 0.86
6. Introduction & Related Work
6
• On Calibration of Modern Neural Networks, ICML 2017. (PR-075)
• Expected Calibration Error (ECE)
0 ≤ Ƹ
𝑝 𝑥 < 0.2 0.2 ≤ Ƹ
𝑝 𝑥 < 0.4 0.4 ≤ Ƹ
𝑝 𝑥 < 0.6 0.6 ≤ Ƹ
𝑝 𝑥 < 0.8 0.8 ≤ Ƹ
𝑝 𝑥 < 1
X O O
O
O
O
Avg. conf : 0.86
Accuracy : 0.71
Gap : 0.15
X
https://arxiv.org/abs/1706.04599
PR-075: On Calibration of Modern Neural Networks (2017)
Video by Geoff Pleiss
7. Introduction & Related Work
7
• On Calibration of Modern Neural Networks, ICML 2017. (PR-075)
• Expected Calibration Error (ECE)
0 ≤ Ƹ
𝑝 𝑥 < 0.2 0.2 ≤ Ƹ
𝑝 𝑥 < 0.4 0.4 ≤ Ƹ
𝑝 𝑥 < 0.6 0.6 ≤ Ƹ
𝑝 𝑥 < 0.8 0.8 ≤ Ƹ
𝑝 𝑥 < 1
X O O
O
O
O
0.86
0.71
7 x 0.15
X
https://arxiv.org/abs/1706.04599
PR-075: On Calibration of Modern Neural Networks (2017)
Video by Geoff Pleiss
X O
X
O
O
O
O
X
Avg. conf :
Accuracy :
Gap :
0.74
0.67
6 x 0.07
0.55
0.50
2 x 0.05
15
Expected Calibration Error (ECE) : 0.11 (Naeini et al. 2015)
8. Introduction & Related Work
8
• On Calibration of Modern Neural Networks, ICML 2017. (PR-075)
• Reliability Diagrams
acc 𝐵𝑚 =
1
𝐵𝑚
σ𝑖∈𝐵𝑚
1 ො
𝑦𝑖 = 𝑦𝑖 , conf 𝐵𝑚 =
1
𝐵𝑚
σ𝑖∈𝐵𝑚
Ƹ
𝑝𝑖
• Expected Calibration Error (ECE)
ECE =
𝑚=1
𝑀
𝐵𝑚
𝑛
acc 𝐵𝑚 − conf 𝐵𝑚
• Negative log likelihood (NLL)
ℒ = −
𝑖=1
𝑛
log ො
𝜋 𝑦𝑖|𝑥𝑖
https://arxiv.org/abs/1706.04599
PR-075: On Calibration of Modern Neural Networks (2017)
Video by Geoff Pleiss
9. Introduction & Related Work
9
• On Calibration of Modern Neural Networks, ICML 2017. (PR-075)
• Temperature Scaling
ො
𝑞𝑖 = max
𝑘
𝜎SM 𝑧𝑖/𝑇 (𝑘)
https://arxiv.org/abs/1706.04599
PR-075: On Calibration of Modern Neural Networks (2017)
Video by Geoff Pleiss
10. Introduction & Related Work
10
• Why should ‘Model Calibration’ be considered in safety-critical applications?
Video by Geoff Pleiss
Model : plastic bag (55% conf.)
Other sensors : person (90% conf.)
Model : cyst (55% conf.)
Clinician : HCC (90% conf.)
11. Introduction & Related Work
11
• Some works suggest a trend for larger, more accurate models to be worse calibrated.
• Recent researches have focused on improving accuracy rather than calibration.
• architecture size, amount of training data, and computing power
• Architecture : MLP-Mixer, ViT, …
• Training approaches : SimCLR, ResNext-WSL, CLIP, …
Q. Do past results on calibration extend to current state-of-the-art models?
https://arxiv.org/abs/1706.04599
14. (Conclusion)
14
• Modern image models are well calibrated across distribution shift despite being designed with a focus on
accuracy.
• Model size and pretraining amount do not fully account for the performance differences between families.
• Architecture is a major determinant of calibration!
• MLP-Mixer and ViT (non-conv) are among the best-calibrated models both in-distribution and OOD.
• Further work on the influence of architecture inductive biases on calibration and OOD robustness will be
necessary to tell whether these results generalize.
15. Empirical Evaluation
15
• In-Distribution Calibration
• Several recent model families (MLP-Mixer, ViT, and BiT) are both highly accurate and well-calibrated.
• A zero-shot model, CLIP, is well-calibrated given its accuracy.
16. Empirical Evaluation
16
• In-Distribution Calibration
• Temperature scaling reveals consistent properties of model families.
• The poor calibration of past models can often be remedied by post-hoc recalibration. (temperature scaling)
→ Does a difference between models remain after recalibration?
→ The most recent architectures are better calibrated than past models even after temperature scaling.
• SimCLR or ResNeXt-WSL tend to be worse calibrated than BiT.
• MLP-Mixer and ViT (non-conv) can perform as well.
17. Empirical Evaluation
17
• In-Distribution Calibration
• Differences between families are not explained by model size or pretraining amount
• Disentangle how the differences between model families affect their calibration properties
• Model size
• Prior work : Larger neural networks are worse calibrated and have lower classification error.
• This study : ViT models are better calibrated than BiT models.
• Changing the size of BiT model cannot move it into the Pareto set of ViT models.
18. Empirical Evaluation
18
• In-Distribution Calibration
• Differences between families are not explained by model size or pretraining amount
• Disentangle how the differences between model families affect their calibration properties
• Model pretraining
• BiT models pretrained on ImageNet (1.3M), ImageNet-21K (12.8M), and JFT-300 (300M)
• It has no consistent effect on calibration. (ViT, MLP-Mixer < BiT < SimCLR)
• The amount of pretraining data and the duration of pretraining
20. Empirical Evaluation
20
• Accuracy and Calibration Under Distribution Shift
• What degree the insights from in-distribution and ImageNet-C calibration transfer to natural OOD data
• Previous : Better in-distribution calibration and accuracy do not predict better calibration under distribution shift.
• This study : The performance on several natural OOD datasets is largely consistent with that on ImageNet.
21. Empirical Evaluation
21
• Relating Accuracy and Calibration Within Model Families
• How to compare individual models within a family, where one model is more accurate but worse
calibrated, and the other is less accurate but better calibrated.
→ Which model should a practitioner choose for a safety-critical application?
• The cost structure of the specific application
• The scenario of selective prediction
• One can choose to ignore the model prediction (“abstain”) at a fixed cost if the prediction confidence is low, rather
than risking a (more costly) prediction error.
22. Empirical Evaluation
22
• Relating Accuracy and Calibration Within Model Families
• Total cost : a linear combination of the misclassification cost and abstention cost at a given cost ratio (x-
axis) and abstention rate (y-axis).
• Blue regions : The higher-accuracy model (R152x4) achieves a lower cost that the better-calibrated model (R50x1).
23. Pitfalls and Limitations
23
• There are two sources of bias
• From estimating ECE by binning : always negative
• From the finite sample size used to estimate the per-bin statistics : always positive
• A larger number of bins → fewer points per bin → higher variance in each bin → positive bias of ECE
24. Pitfalls and Limitations
24
• More formally,
• To estimate accuracy 𝐵𝑖 precisely, we need a number of samples inversely proportional to the standard
deviation 𝒑𝒊 𝟏 − 𝒑𝒊 / 𝑩𝒊 , where 𝑝𝑖 is the expected accuracy in 𝐵𝑖.
• Bias : models with extreme average accuracies (i.e. close to 0 or 1) < models with an accuracy close to 0.5
• If we estimate 𝔼 accuracy 𝐵𝑖 − 𝔼 confidence 𝐵𝑖
2
for any bin 𝑖 with 𝑛𝑖 samples
𝑏𝑖𝑎𝑠 =
1
𝑛𝑖
𝕍 𝐴 + 𝕍 𝐶 − 2Cov 𝐶, 𝐴
• 𝐴 = 𝑌 ∈ arg max 𝑓 𝑋 , 𝐶 = max 𝑓 𝑋
• Higher accuracy models have a lower bias not only due to higher accuracy (lower 𝕍 𝑨 ), but also because their
outputs correlate more with the correct label (higher covariance).
25. Conclusion
25
• Modern image models are well calibrated across distribution shift despite being designed with a focus on
accuracy.
• Model size and pretraining amount do not fully account for the performance differences between families.
• Architecture is a major determinant of calibration!
• MLP-Mixer and ViT (non-conv) are among the best-calibrated models both in-distribution and OOD.
• Further work on the influence of architecture inductive biases on calibration and OOD robustness will be
necessary to tell whether these results generalize.