[Explained] "Partial Success in Closing the Gap between Human and Machine Vision" Geirhos et al., 2021

Cognitive Informatics Lab.
Dept. of Intelligence Science and Technology,
Graduate School of Informatics, Kyoto University
Sou Yoshihara (M2)
arXiv:2106.07411
0

Abstract: 3 findings
l The longstanding robustness gap between humans and CNNs is closing
l There is still a substantial image-level consistency gap, meaning that
humans make different errors than models
l In many cases, human-to-model consistency improves when training
dataset size is increased by one to three orders of magnitude
This evaluation is open-sourced as a benchmark to track future progress.
(https://github.com/bethgelab/model-vs-human/)
1

Introduction
l Currently, models are routinely matching and in many cases even outperforming humans
l At the same time, it is becoming increasingly clear that models systematically exploit shortcuts
shared between training and test data
l Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to
more challenging testing conditions (e.g. real-world scenarios) (Geirhos et al., 2020, arXiv:2004.07780)
toy example of shortcut learning in neural networks: When trained on a simple dataset of stars and moons, a standard
fully connected neural network learns a shortcut strategy: classifying based on the location (stars in the top right or bottom left;
moons in the top left or bottom right) rather than the shape of the objects
(Geirhos et al., 2020)
2

Out-of-Distribution (OOD) data
l Out-of-Distribution (OOD) data:
Testing models on more challenging test cases where there is still a ground truth category, but
certain image statistics differ from the training distribution
l Previous works
○ ImageNet-C (Hendrycks et al., 2019)
○ ImageNet-Sketch (Wang et al., 2019)
○ Stylized-ImageNet (Geirhos et al., 2019)
They lack human comparison data
The authors tested human observers in a lab on OOD datasets
(85K psychophysical trials across 90 participants)
ImageNet-C
4

17 OOD datasets
high-pass
colour vs. grayscale low contrast low-pass/blurring phase noise
true power spectrum
vs. power equalisation
true vs. opponent colour rotation Eidolon I Eidolon II Eidolon III uniform noise
ImageNet
160 160 1280
5

3 axes of models
l Objective function
(supervised vs. self-supervised, adversarially trained, and CLIP’s joint
language-image training)
l Architecture
(convolutional vs. vision transformer)
l Training dataset size
(ranging from 1M to 1B images)
6

Psychophysical experiments
l Psychophysical experiments in a lab
○ 90 observers were tested in a darkened chamber
○ 16 categories (such as chair, dog, airplane, etc.)
○ 22” monitor with 1920 × 1200 pixels resolution (refresh rate: 120 Hz)
○ Viewing distance: 107 cm
○ Target images at the center subtended 3 × 3 degrees of visual angle
○ 200 ms followed by a 1/f backward mask
Measures against COVID-19 risks
7

Models
l 16 categories:
The 1000 class decision vector was mapped to those 16 classes using the
WordNet hierarchy(Miller, 1995).
l 52 models
○ 24 standard ImageNet-trained CNNs
○ 8 self-supervised models
○ 6 Big Transfer models
○ 5 adversarially trained models
○ 5 vision transformers
○ 2 semi-weakly supervised models
○ Noisy Student
○ CLIP
8

Metrics
l OOD accuracy (averaged across conditions and datasets)
l Accuracy difference A(m)
l Observed consistency O(m)
l Error consistency E(m): It tracks whether there is above-chance consistency
1 if humans and a model are either both right or both wrong, 0 otherwise
!
𝑜!,# 𝑆$,% : expected consistency
𝑑: 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
ℎ: ℎ𝑢𝑚𝑎𝑛
𝑐: 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛
𝑠: 𝑠𝑎𝑚𝑝𝑙𝑒 (𝑎𝑛 𝑖𝑚𝑎𝑔𝑒)
9

Error consistency E(m)
l Error consistency E(m)
l Cohen’s kappa
indicates whether the observed consistency is larger than
what could have been expected
!
𝑜!,# 𝑆$,% : expected consistency given two independent binomial decision makers with matched accuracy,
only random consistency
Cohen’s kappa agreement
< 0 no agreement
0–0.20 slight
0.21–0.40 fair
0.41–0.60 moderate
0.61–0.80 substantial
0.81–1 almost perfect
(Cohen's kappa, Wikipedia)
(Eq. from Geirhos et al., 2020, arXiv:2006.16736)
𝑐&'(:𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦
𝑐)*+:𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦
Why error consistency?
e.g.) Two decision makers with 95% accuracy each will have at least 90% observed consistency
(intuitively, they both get most images correct and thus observed overlap is high)
𝑝:𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
1 − 𝑝:𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
10

Results on OOD datasets
The OOD robustness gap between human and machine vision is closing (top),
but an image-level consistency gap remains (bottom, especially (d)).
↓ : trained on large-scale datasets
humans
standard supervised CNNs
self-supervised models
adversarially trained models
vision transformers
noisy student
BiT
SWSL
CLIP
11

Results on OOD datasets
humans
vision transformers
noisy student
BiT
SWSL
CLIP
12

Robustness across models
Self-supervised models
SimCLR variants (SimCLR-x1, SimCLR-x2,
SimCLR-x4) show strong generalisation
improvements on uniform noise, low contrast,
and high-pass images
This is quite remarkable given that SimCLR
models were trained on a different set of
augmentations (random crop with flip and
resize, colour distortion, and Gaussian blur)
→ Is the defining factor
objective function or
the choice of augmentations?
https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html
An illustration of SimCLR
humans, standard supervised CNNs, self-supervised models,
adversarially trained models, vision transformers, noisy student, BiT, SWSL, CLIP
13

objective function vs. the choice of augmentations?
triangles: self-supervised models
stars: supervised baselines
blue: ResNet x1
green: ResNet x4
red diamonds: human
Self-supervised models vs.
Augmentation-matched supervised baseline models
Augmentations:
random crop with flip and resize
colour distortion
Gaussian blur
The augmentation scheme (rather than the
self-supervised objective) indeed made the
crucial difference:
augmentation-matched supervised baselines
show just the same generalisation behaviour.
15

Adversarially trained models
The stronger the model is trained
adversarially (darker shades of blue), the
more susceptible it becomes to (random)
image degradations.
A simple rotation by 90 degrees leads to a
50% drop in classification accuracy.
Adversarial robustness seems to come
at the cost of increased vulnerability to
large-scale perturbations.
Adversarial examples (Goodfellow et al., 2014)
16

Adversarial training increases shape bias
There is a relationship between shape bias and the degree of adversarial training
𝑠ℎ𝑎𝑝𝑒 𝑏𝑖𝑎𝑠 =
%&''(%) 𝒔𝒉𝒂𝒑𝒆 $(%/0/&10
%&''(%) 𝒔𝒉𝒂𝒑𝒆 $(%/0/&10 7%&''(%) 𝒕𝒆𝒙𝒕𝒖𝒓𝒆 $(%/0/&10
shape texture
Stimuli: texture-shape cue conflict images
Made by Style Transfer (Gatys et al., 2016)
1,200 images
(16 classes, 75 images per shape label)
URL: https://github.com/rgeirhos/texture-vs-shape/tree/master/stimuli/style-transfer-preprocessed-512 (Geirhos et al., 2019) 17

Vision transformers
The best vision transformer (ViT-L
trained on 14M images) even exceeds
human OOD accuracy
Vision transformers trained on 1M
images (light green) are already better
than standard convolutional models
Higher shape bias (The results are not shown here)
c.f. Tuli et al. (2021), Naseer et al. (2021)
(Dosovitskiy et al., 2020, arXiv:2010.11929)
19

Standard models trained on
more data: BiT-M, SWSL, Noisy
Student
The biggest effect on OOD
robustness simply comes from
training on larger datasets, not from
advanced architectures
BiT: Big Transfer
(Kolesnikov et al., 2019, arXiv:1912.11370)
ResNet152x4: # neurons of each layer is 4x
BiT-M: trained on ImageNet-21k
SWSL: Semi-Weakly Supervised Learning
(Yalniz et al., 2019, arXiv:1905.00546)
Trained on 940M
Noisy Student
(Xie et al., 2020, arXiv:1911.04252)
Trained on 300M
20

Robustness across models ↓ : trained on large-scale datasets
CLIP
“Special”
- more data:
trained on 400M
- novel objective:
joint language-image supervision
- non-standard architecture:
a vision transformer backbone
The most human-like model across
all of metrics
(Radford et al., 2021, arXiv:2103.00020) 21

Error consistency between models (“sketch” dataset)
Do they make errors on the same individual images?
(“sketch” dataset)
A standard supervised model, a self-supervised model,
an adversarially trained model or a vision transformer, all
those models make highly systematic errors
Humans show a very different pattern of errors
The boundary between humans and some data-rich
models, especially CLIP (400M images) and SWSL
(940M), is blurry, making more human-like errors than
standard models.
Error consistency analysis on a single dataset, ”sketch” (for other datasets see the original paper, Figures 9, 11, 12, 13, 14)
22

Error consistency between models (17 OOD datasets)
We can see that data-rich models approach human-to-human observed consistency, but not error consistency.
Observed consistency is not a good measure of image-level consistency since it does not take consistency by
chance into account; error consistency tracks whether there is consistency beyond chance.
We see that there is still a substantial image-level consistency gap between human and machine vision.
However, several models improve over vanilla CNNs, especially BiT-M (trained on 14M images) and CLIP
(400M images).
23

Error consistency aggregated over multiple datasets
humans
vision transformers
noisy student
BiT
SWSL
CLIP
(sketch, silhouette, edge, cue conflict, low-pass)
○ : convolutional
▽: vision transformer
◇: human
OOD accuracy is a near-perfect predictor of image-level
consistency; especially data-rich models (e.g. CLIP, SWSL,
BiT) narrow the consistency gap to humans
Training on large-scale datasets leads to considerable
improvements along both architectures, convolutional and
vision transformer.
(stylized, colour/greyscale, contrast, high-pass, phase-scrambling, power-
equalisation, false colour, rotation, eidolonI, -II and -III as well as uniform noise),
The human-machine gap is large; here, more robust
models do not show improved error consistency
24

Error consistency
l It remains an open question why the training dataset appears to have the most important impact
on a model’s decision boundary as measured by error consistency (as opposed to other
aspects of a model’s inductive bias).
l Datasets contain various shortcut opportunities (Geirhos et al. 2020), and if two different models
are trained on similar data, they might converge to a similar solution simply by exploiting the
same shortcuts. Making models more flexible (such as transformers, a generalisation of CNNs)
wouldn’t change much in this regard
l What affects error consistency?
Dataset vs. Architecture, Flexibility vs. Constraints →Next slide
25

Error consistency
Error consistency between two identical models trained on very different datasets,
such as ImageNet vs. Stylized-ImageNet (Geirhos et al., 2019), is much lower than
error consistency between very different models (ResNet-50 vs. VGG-16) trained on
the same dataset.
Dataset vs. Architecture
Flexibility vs. constraints
Error consistency between ResNet-50 and a highly flexible model (e.g., a vision
transformer) is much higher than error consistency between ResNet-50 and a highly
constrained model like BagNet-9 (Brendel and Bethge, 2019, arXiv:1904.00760)
BagNet
The models extract features
from small image patches
Stylized-ImageNet (SIN)
“texture-less dataset”
→ Dataset is more important
→ Flexibility is more important
26

Summary
l While self-supervised and adversarially trained models lack OOD robustness,
models based on vision transformers and/or trained on large-scale datasets now
match or exceed human performance on most datasets
l The OOD robustness gap between human and machine vision is closing, as the
best models now match or exceed human accuracies.
l At the same time, an image-level consistency gap remains, however, this gap that
is at least in some cases narrowing for models trained on large-scale datasets.
This evaluation is open-sourced as a benchmark to track future progress.
(https://github.com/bethgelab/model-vs-human/)
27

(Supplementary) Eidolon
https://github.com/gestaltrevision/Eidolon
Geirhos, R., Temme, C. R. M., Rauber, J., Schütt, H. H., Bethge, M., & Wichmann, F. A. (2018). Generalisation in humans and deep neural networks. http://arxiv.org/abs/1808.08750
28

[Explained] "Partial Success in Closing the Gap between Human and Machine Vision" Geirhos et al., 2021

Recommended

Recommended

More Related Content

Similar to [Explained] "Partial Success in Closing the Gap between Human and Machine Vision" Geirhos et al., 2021

Similar to [Explained] "Partial Success in Closing the Gap between Human and Machine Vision" Geirhos et al., 2021 (20)

Recently uploaded

Recently uploaded (20)

[Explained] "Partial Success in Closing the Gap between Human and Machine Vision" Geirhos et al., 2021