Revisiting the Calibration of Modern Neural Networks

Contents
1. Introduction & Related Work
2. (Conclusion)
3. Empirical Evaluation
4. Pitfalls and Limitations
5. Conclusion
2

Introduction & Related Work
3
• What is ‘Model Calibration’?
• The accuracy with which the scores provided by the model reflect its predictive uncertainty
• It is important in safety-critical applications, such as autonomous driving, medical diagnosis, forecasting
• On Calibration of Modern Neural Networks, ICML 2017. (PR-075)
https://arxiv.org/abs/1706.04599
PR-075: On Calibration of Modern Neural Networks (2017)

4
• Expected Calibration Error (ECE)
Video by Geoff Pleiss
Neural
Network
Input 𝑥
Logits 𝑧 𝑥
𝑒𝑧 𝑥
σ𝑖 𝑒𝑧𝑖 𝑥
Probability Ƹ
𝑝 𝑥
0 ≤ Ƹ
𝑝 𝑥 < 0.2 0.2 ≤ Ƹ
𝑝 𝑥 < 0.4 0.4 ≤ Ƹ
𝑝 𝑥 < 0.6 0.6 ≤ Ƹ
𝑝 𝑥 < 0.8 0.8 ≤ Ƹ
𝑝 𝑥 < 1

5
0 ≤ Ƹ
𝑝 𝑥 < 0.2 0.2 ≤ Ƹ
𝑝 𝑥 < 0.4 0.4 ≤ Ƹ
𝑝 𝑥 < 0.6 0.6 ≤ Ƹ
𝑝 𝑥 < 0.8 0.8 ≤ Ƹ
𝑝 𝑥 < 1
X O O O
O
O
O
Avg. conf : 0.86
Accuracy : 0.86

6
0 ≤ Ƹ
𝑝 𝑥 < 0.2 0.2 ≤ Ƹ
𝑝 𝑥 < 0.4 0.4 ≤ Ƹ
𝑝 𝑥 < 0.6 0.6 ≤ Ƹ
𝑝 𝑥 < 0.8 0.8 ≤ Ƹ
𝑝 𝑥 < 1
X O O
O
O
O
Avg. conf : 0.86
Accuracy : 0.71
Gap : 0.15
X

7
0 ≤ Ƹ
𝑝 𝑥 < 0.2 0.2 ≤ Ƹ
𝑝 𝑥 < 0.4 0.4 ≤ Ƹ
𝑝 𝑥 < 0.6 0.6 ≤ Ƹ
𝑝 𝑥 < 0.8 0.8 ≤ Ƹ
𝑝 𝑥 < 1
X O O
O
O
O
0.86
0.71
7 x 0.15
X
X O
X
O
O
O
O
X
Avg. conf :
Accuracy :
Gap :
0.74
0.67
6 x 0.07
0.55
0.50
2 x 0.05
15
Expected Calibration Error (ECE) : 0.11 (Naeini et al. 2015)

8
• Reliability Diagrams
acc 𝐵𝑚 =
1
𝐵𝑚
σ𝑖∈𝐵𝑚
1 ො
𝑦𝑖 = 𝑦𝑖 , conf 𝐵𝑚 =
1
𝐵𝑚
σ𝑖∈𝐵𝑚
Ƹ
𝑝𝑖
ECE = ෍
𝑚=1
𝑀
𝐵𝑚
𝑛
acc 𝐵𝑚 − conf 𝐵𝑚
• Negative log likelihood (NLL)
ℒ = − ෍
𝑖=1
𝑛
log ො
𝜋 𝑦𝑖|𝑥𝑖

9
• Temperature Scaling
ො
𝑞𝑖 = max
𝑘
𝜎SM 𝑧𝑖/𝑇 (𝑘)

10
• Why should ‘Model Calibration’ be considered in safety-critical applications?
Model : plastic bag (55% conf.)
Other sensors : person (90% conf.)
Model : cyst (55% conf.)
Clinician : HCC (90% conf.)

11
• Some works suggest a trend for larger, more accurate models to be worse calibrated.
• Recent researches have focused on improving accuracy rather than calibration.
• architecture size, amount of training data, and computing power
• Architecture : MLP-Mixer, ViT, …
• Training approaches : SimCLR, ResNext-WSL, CLIP, …
Q. Do past results on calibration extend to current state-of-the-art models?

Empirical Evaluation
12
• Experimental Setup
• Model families
• MLP-Mixer (PR-317)
• ViT (PR-281)
• BiT
• ResNeXt-WSL
• SimCLR (PR-231)
• CLIP
• AlexNet
https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

13
• Experimental Setup
• Datasets
• ImageNet
• ImageNetV2
• ImageNet-C (PR-277)
• ImageNet-R
• ImageNet-A
• ImageNet-21K
• JFT-300
https://ieeexplore.ieee.org/document/5206848

(Conclusion)
14
• Modern image models are well calibrated across distribution shift despite being designed with a focus on
accuracy.
• Model size and pretraining amount do not fully account for the performance differences between families.
• Architecture is a major determinant of calibration!
• MLP-Mixer and ViT (non-conv) are among the best-calibrated models both in-distribution and OOD.
• Further work on the influence of architecture inductive biases on calibration and OOD robustness will be
necessary to tell whether these results generalize.

15
• In-Distribution Calibration
• Several recent model families (MLP-Mixer, ViT, and BiT) are both highly accurate and well-calibrated.
• A zero-shot model, CLIP, is well-calibrated given its accuracy.

16
• Temperature scaling reveals consistent properties of model families.
• The poor calibration of past models can often be remedied by post-hoc recalibration. (temperature scaling)
→ Does a difference between models remain after recalibration?
→ The most recent architectures are better calibrated than past models even after temperature scaling.
• SimCLR or ResNeXt-WSL tend to be worse calibrated than BiT.
• MLP-Mixer and ViT (non-conv) can perform as well.

17
• Differences between families are not explained by model size or pretraining amount
• Disentangle how the differences between model families affect their calibration properties
• Model size
• Prior work : Larger neural networks are worse calibrated and have lower classification error.
• This study : ViT models are better calibrated than BiT models.
• Changing the size of BiT model cannot move it into the Pareto set of ViT models.

18
• Differences between families are not explained by model size or pretraining amount
• Disentangle how the differences between model families affect their calibration properties
• Model pretraining
• BiT models pretrained on ImageNet (1.3M), ImageNet-21K (12.8M), and JFT-300 (300M)
• It has no consistent effect on calibration. (ViT, MLP-Mixer < BiT < SimCLR)
• The amount of pretraining data and the duration of pretraining

19
• Accuracy and Calibration Under Distribution Shift
• ImageNet-C

20
• Accuracy and Calibration Under Distribution Shift
• What degree the insights from in-distribution and ImageNet-C calibration transfer to natural OOD data
• Previous : Better in-distribution calibration and accuracy do not predict better calibration under distribution shift.
• This study : The performance on several natural OOD datasets is largely consistent with that on ImageNet.

21
• Relating Accuracy and Calibration Within Model Families
• How to compare individual models within a family, where one model is more accurate but worse
calibrated, and the other is less accurate but better calibrated.
→ Which model should a practitioner choose for a safety-critical application?
• The cost structure of the specific application
• The scenario of selective prediction
• One can choose to ignore the model prediction (“abstain”) at a fixed cost if the prediction confidence is low, rather
than risking a (more costly) prediction error.

22
• Relating Accuracy and Calibration Within Model Families
• Total cost : a linear combination of the misclassification cost and abstention cost at a given cost ratio (x-
axis) and abstention rate (y-axis).
• Blue regions : The higher-accuracy model (R152x4) achieves a lower cost that the better-calibrated model (R50x1).

Pitfalls and Limitations
23
• There are two sources of bias
• From estimating ECE by binning : always negative
• From the finite sample size used to estimate the per-bin statistics : always positive
• A larger number of bins → fewer points per bin → higher variance in each bin → positive bias of ECE

Pitfalls and Limitations
24
• More formally,
• To estimate accuracy 𝐵𝑖 precisely, we need a number of samples inversely proportional to the standard
deviation 𝒑𝒊 𝟏 − 𝒑𝒊 / 𝑩𝒊 , where 𝑝𝑖 is the expected accuracy in 𝐵𝑖.
• Bias : models with extreme average accuracies (i.e. close to 0 or 1) < models with an accuracy close to 0.5
• If we estimate 𝔼 accuracy 𝐵𝑖 − 𝔼 confidence 𝐵𝑖
2
for any bin 𝑖 with 𝑛𝑖 samples
𝑏𝑖𝑎𝑠 =
1
𝑛𝑖
𝕍 𝐴 + 𝕍 𝐶 − 2Cov 𝐶, 𝐴
• 𝐴 = 𝑌 ∈ arg max 𝑓 𝑋 , 𝐶 = max 𝑓 𝑋
• Higher accuracy models have a lower bias not only due to higher accuracy (lower 𝕍 𝑨 ), but also because their
outputs correlate more with the correct label (higher covariance).

Conclusion
25
• Modern image models are well calibrated across distribution shift despite being designed with a focus on
accuracy.
• Model size and pretraining amount do not fully account for the performance differences between families.
• Architecture is a major determinant of calibration!
• MLP-Mixer and ViT (non-conv) are among the best-calibrated models both in-distribution and OOD.
• Further work on the influence of architecture inductive biases on calibration and OOD robustness will be
necessary to tell whether these results generalize.

Revisiting the Calibration of Modern Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Revisiting the Calibration of Modern Neural Networks

Similar to Revisiting the Calibration of Modern Neural Networks (20)

More from Sungchul Kim

More from Sungchul Kim (20)

Recently uploaded

Recently uploaded (20)

Revisiting the Calibration of Modern Neural Networks