SlideShare a Scribd company logo
Cognitive Informatics Lab.
Dept. of Intelligence Science and Technology,
Graduate School of Informatics, Kyoto University
Sou Yoshihara (M2)
arXiv:2106.07411
0
Abstract: 3 findings
l The longstanding robustness gap between humans and CNNs is closing
l There is still a substantial image-level consistency gap, meaning that
humans make different errors than models
l In many cases, human-to-model consistency improves when training
dataset size is increased by one to three orders of magnitude
This evaluation is open-sourced as a benchmark to track future progress.
(https://github.com/bethgelab/model-vs-human/)
1
Introduction
l Currently, models are routinely matching and in many cases even outperforming humans
l At the same time, it is becoming increasingly clear that models systematically exploit shortcuts
shared between training and test data
l Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to
more challenging testing conditions (e.g. real-world scenarios) (Geirhos et al., 2020, arXiv:2004.07780)
toy example of shortcut learning in neural networks: When trained on a simple dataset of stars and moons, a standard
fully connected neural network learns a shortcut strategy: classifying based on the location (stars in the top right or bottom left;
moons in the top left or bottom right) rather than the shape of the objects
(Geirhos et al., 2020)
2
Out-of-Distribution (OOD) data
l Out-of-Distribution (OOD) data:
Testing models on more challenging test cases where there is still a ground truth category, but
certain image statistics differ from the training distribution
l Previous works
○ ImageNet-C (Hendrycks et al., 2019)
○ ImageNet-Sketch (Wang et al., 2019)
○ Stylized-ImageNet (Geirhos et al., 2019)
They lack human comparison data
The authors tested human observers in a lab on OOD datasets
(85K psychophysical trials across 90 participants)
ImageNet-C
4
17 OOD datasets
high-pass
colour vs. grayscale low contrast low-pass/blurring phase noise
true power spectrum
vs. power equalisation
true vs. opponent colour rotation Eidolon I Eidolon II Eidolon III uniform noise
ImageNet
160 160 1280
5
3 axes of models
l Objective function
(supervised vs. self-supervised, adversarially trained, and CLIP’s joint
language-image training)
l Architecture
(convolutional vs. vision transformer)
l Training dataset size
(ranging from 1M to 1B images)
6
Psychophysical experiments
l Psychophysical experiments in a lab
○ 90 observers were tested in a darkened chamber
○ 16 categories (such as chair, dog, airplane, etc.)
○ 22” monitor with 1920 × 1200 pixels resolution (refresh rate: 120 Hz)
○ Viewing distance: 107 cm
○ Target images at the center subtended 3 × 3 degrees of visual angle
○ 200 ms followed by a 1/f backward mask
Measures against COVID-19 risks
7
Models
l 16 categories:
The 1000 class decision vector was mapped to those 16 classes using the
WordNet hierarchy(Miller, 1995).
l 52 models
○ 24 standard ImageNet-trained CNNs
○ 8 self-supervised models
○ 6 Big Transfer models
○ 5 adversarially trained models
○ 5 vision transformers
○ 2 semi-weakly supervised models
○ Noisy Student
○ CLIP
8
Metrics
l OOD accuracy (averaged across conditions and datasets)
l Accuracy difference A(m)
l Observed consistency O(m)
l Error consistency E(m): It tracks whether there is above-chance consistency
1 if humans and a model are either both right or both wrong, 0 otherwise
!
𝑜!,# 𝑆$,% : expected consistency
𝑑: 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
ℎ: ℎ𝑢𝑚𝑎𝑛
𝑐: 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛
𝑠: 𝑠𝑎𝑚𝑝𝑙𝑒 (𝑎𝑛 𝑖𝑚𝑎𝑔𝑒)
9
Error consistency E(m)
l Error consistency E(m)
l Cohen’s kappa
indicates whether the observed consistency is larger than
what could have been expected
!
𝑜!,# 𝑆$,% : expected consistency given two independent binomial decision makers with matched accuracy,
only random consistency
Cohen’s kappa agreement
< 0 no agreement
0–0.20 slight
0.21–0.40 fair
0.41–0.60 moderate
0.61–0.80 substantial
0.81–1 almost perfect
(Cohen's kappa, Wikipedia)
(Eq. from Geirhos et al., 2020, arXiv:2006.16736)
𝑐&'(:𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦
𝑐)*+:𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦
Why error consistency?
e.g.) Two decision makers with 95% accuracy each will have at least 90% observed consistency
(intuitively, they both get most images correct and thus observed overlap is high)
𝑝:𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
1 − 𝑝:𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
10
Results on OOD datasets
The OOD robustness gap between human and machine vision is closing (top),
but an image-level consistency gap remains (bottom, especially (d)).
↓ : trained on large-scale datasets
humans
standard supervised CNNs
self-supervised models
adversarially trained models
vision transformers
noisy student
BiT
SWSL
CLIP
11
Results on OOD datasets
humans
standard supervised CNNs
self-supervised models
adversarially trained models
vision transformers
noisy student
BiT
SWSL
CLIP
12
Robustness across models
Self-supervised models
SimCLR variants (SimCLR-x1, SimCLR-x2,
SimCLR-x4) show strong generalisation
improvements on uniform noise, low contrast,
and high-pass images
This is quite remarkable given that SimCLR
models were trained on a different set of
augmentations (random crop with flip and
resize, colour distortion, and Gaussian blur)
→ Is the defining factor
objective function or
the choice of augmentations?
https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html
An illustration of SimCLR
humans, standard supervised CNNs, self-supervised models,
adversarially trained models, vision transformers, noisy student, BiT, SWSL, CLIP
13
objective function vs. the choice of augmentations?
triangles: self-supervised models
stars: supervised baselines
blue: ResNet x1
green: ResNet x4
red diamonds: human
Self-supervised models vs.
Augmentation-matched supervised baseline models
Augmentations:
random crop with flip and resize
colour distortion
Gaussian blur
The augmentation scheme (rather than the
self-supervised objective) indeed made the
crucial difference:
augmentation-matched supervised baselines
show just the same generalisation behaviour.
15
Robustness across models
Adversarially trained models
The stronger the model is trained
adversarially (darker shades of blue), the
more susceptible it becomes to (random)
image degradations.
A simple rotation by 90 degrees leads to a
50% drop in classification accuracy.
Adversarial robustness seems to come
at the cost of increased vulnerability to
large-scale perturbations.
Adversarial examples (Goodfellow et al., 2014)
humans, standard supervised CNNs, self-supervised models,
adversarially trained models, vision transformers, noisy student, BiT, SWSL, CLIP
16
Adversarial training increases shape bias
There is a relationship between shape bias and the degree of adversarial training
𝑠ℎ𝑎𝑝𝑒 𝑏𝑖𝑎𝑠 =
%&''(%) 𝒔𝒉𝒂𝒑𝒆 $(%/0/&10
%&''(%) 𝒔𝒉𝒂𝒑𝒆 $(%/0/&10 7%&''(%) 𝒕𝒆𝒙𝒕𝒖𝒓𝒆 $(%/0/&10
shape texture
Stimuli: texture-shape cue conflict images
Made by Style Transfer (Gatys et al., 2016)
1,200 images
(16 classes, 75 images per shape label)
URL: https://github.com/rgeirhos/texture-vs-shape/tree/master/stimuli/style-transfer-preprocessed-512 (Geirhos et al., 2019) 17
Robustness across models
Vision transformers
The best vision transformer (ViT-L
trained on 14M images) even exceeds
human OOD accuracy
Vision transformers trained on 1M
images (light green) are already better
than standard convolutional models
Higher shape bias (The results are not shown here)
c.f. Tuli et al. (2021), Naseer et al. (2021)
(Dosovitskiy et al., 2020, arXiv:2010.11929)
↓ : trained on large-scale datasets
humans, standard supervised CNNs, self-supervised models,
adversarially trained models, vision transformers, noisy student, BiT, SWSL, CLIP
19
Robustness across models
Standard models trained on
more data: BiT-M, SWSL, Noisy
Student
The biggest effect on OOD
robustness simply comes from
training on larger datasets, not from
advanced architectures
BiT: Big Transfer
(Kolesnikov et al., 2019, arXiv:1912.11370)
ResNet152x4: # neurons of each layer is 4x
BiT-M: trained on ImageNet-21k
SWSL: Semi-Weakly Supervised Learning
(Yalniz et al., 2019, arXiv:1905.00546)
Trained on 940M
Noisy Student
(Xie et al., 2020, arXiv:1911.04252)
Trained on 300M
↓ : trained on large-scale datasets
humans, standard supervised CNNs, self-supervised models,
adversarially trained models, vision transformers, noisy student, BiT, SWSL, CLIP
20
Robustness across models ↓ : trained on large-scale datasets
CLIP
“Special”
- more data:
trained on 400M
- novel objective:
joint language-image supervision
- non-standard architecture:
a vision transformer backbone
The most human-like model across
all of metrics
humans, standard supervised CNNs, self-supervised models,
adversarially trained models, vision transformers, noisy student, BiT, SWSL, CLIP
(Radford et al., 2021, arXiv:2103.00020) 21
Error consistency between models (“sketch” dataset)
Do they make errors on the same individual images?
(“sketch” dataset)
A standard supervised model, a self-supervised model,
an adversarially trained model or a vision transformer, all
those models make highly systematic errors
Humans show a very different pattern of errors
The boundary between humans and some data-rich
models, especially CLIP (400M images) and SWSL
(940M), is blurry, making more human-like errors than
standard models.
Error consistency analysis on a single dataset, ”sketch” (for other datasets see the original paper, Figures 9, 11, 12, 13, 14)
22
Error consistency between models (17 OOD datasets)
We can see that data-rich models approach human-to-human observed consistency, but not error consistency.
Observed consistency is not a good measure of image-level consistency since it does not take consistency by
chance into account; error consistency tracks whether there is consistency beyond chance.
We see that there is still a substantial image-level consistency gap between human and machine vision.
However, several models improve over vanilla CNNs, especially BiT-M (trained on 14M images) and CLIP
(400M images).
23
Error consistency aggregated over multiple datasets
humans
standard supervised CNNs
self-supervised models
adversarially trained models
vision transformers
noisy student
BiT
SWSL
CLIP
(sketch, silhouette, edge, cue conflict, low-pass)
○ : convolutional
▽: vision transformer
◇: human
OOD accuracy is a near-perfect predictor of image-level
consistency; especially data-rich models (e.g. CLIP, SWSL,
BiT) narrow the consistency gap to humans
Training on large-scale datasets leads to considerable
improvements along both architectures, convolutional and
vision transformer.
(stylized, colour/greyscale, contrast, high-pass, phase-scrambling, power-
equalisation, false colour, rotation, eidolonI, -II and -III as well as uniform noise),
The human-machine gap is large; here, more robust
models do not show improved error consistency
24
Error consistency
l It remains an open question why the training dataset appears to have the most important impact
on a model’s decision boundary as measured by error consistency (as opposed to other
aspects of a model’s inductive bias).
l Datasets contain various shortcut opportunities (Geirhos et al. 2020), and if two different models
are trained on similar data, they might converge to a similar solution simply by exploiting the
same shortcuts. Making models more flexible (such as transformers, a generalisation of CNNs)
wouldn’t change much in this regard
l What affects error consistency?
Dataset vs. Architecture, Flexibility vs. Constraints →Next slide
25
Error consistency
Error consistency between two identical models trained on very different datasets,
such as ImageNet vs. Stylized-ImageNet (Geirhos et al., 2019), is much lower than
error consistency between very different models (ResNet-50 vs. VGG-16) trained on
the same dataset.
Dataset vs. Architecture
Flexibility vs. constraints
Error consistency between ResNet-50 and a highly flexible model (e.g., a vision
transformer) is much higher than error consistency between ResNet-50 and a highly
constrained model like BagNet-9 (Brendel and Bethge, 2019, arXiv:1904.00760)
BagNet
The models extract features
from small image patches
Stylized-ImageNet (SIN)
“texture-less dataset”
→ Dataset is more important
→ Flexibility is more important
26
Summary
l While self-supervised and adversarially trained models lack OOD robustness,
models based on vision transformers and/or trained on large-scale datasets now
match or exceed human performance on most datasets
l The OOD robustness gap between human and machine vision is closing, as the
best models now match or exceed human accuracies.
l At the same time, an image-level consistency gap remains, however, this gap that
is at least in some cases narrowing for models trained on large-scale datasets.
This evaluation is open-sourced as a benchmark to track future progress.
(https://github.com/bethgelab/model-vs-human/)
27
(Supplementary) Eidolon
https://github.com/gestaltrevision/Eidolon
Geirhos, R., Temme, C. R. M., Rauber, J., Schütt, H. H., Bethge, M., & Wichmann, F. A. (2018). Generalisation in humans and deep neural networks. http://arxiv.org/abs/1808.08750
28

More Related Content

Similar to [Explained] "Partial Success in Closing the Gap between Human and Machine Vision" Geirhos et al., 2021

Image generative modeling for design inspiration and image editing by Camille...
Image generative modeling for design inspiration and image editing by Camille...Image generative modeling for design inspiration and image editing by Camille...
Image generative modeling for design inspiration and image editing by Camille...
Paris Women in Machine Learning and Data Science
 
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Anjani Dhrangadhariya
 
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Institute of Information Systems (HES-SO)
 
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
AIST
 
AnoMalNet: outlier detection based malaria cell image classification method l...
AnoMalNet: outlier detection based malaria cell image classification method l...AnoMalNet: outlier detection based malaria cell image classification method l...
AnoMalNet: outlier detection based malaria cell image classification method l...
International Journal of Reconfigurable and Embedded Systems
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and Hype
Siby Jose Plathottam
 
AGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptxAGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptx
ssuserb4a9ba
 
ageandgenderdetection-220802061020-9ee5a2cd.pptx
ageandgenderdetection-220802061020-9ee5a2cd.pptxageandgenderdetection-220802061020-9ee5a2cd.pptx
ageandgenderdetection-220802061020-9ee5a2cd.pptx
dhaliwalharsh055
 
Generative Adversarial Networks and Their Applications in Medical Imaging
Generative Adversarial Networks  and Their Applications in Medical ImagingGenerative Adversarial Networks  and Their Applications in Medical Imaging
Generative Adversarial Networks and Their Applications in Medical Imaging
Sanghoon Hong
 
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Chris Rackauckas
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 
Materials Science in the Era of Knowledge Discovery and Artificial Inteligence
Materials Science in the Era of Knowledge Discovery and Artificial InteligenceMaterials Science in the Era of Knowledge Discovery and Artificial Inteligence
Materials Science in the Era of Knowledge Discovery and Artificial Inteligence
BMRS Meeting
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
researchpaper_2023_Skin_Csdbjsjvnvsdnfvancer.pdf
researchpaper_2023_Skin_Csdbjsjvnvsdnfvancer.pdfresearchpaper_2023_Skin_Csdbjsjvnvsdnfvancer.pdf
researchpaper_2023_Skin_Csdbjsjvnvsdnfvancer.pdf
AvijitChaudhuri3
 
IEEE Medical image Title and Abstract 2016
IEEE Medical image Title and Abstract 2016 IEEE Medical image Title and Abstract 2016
IEEE Medical image Title and Abstract 2016
tsysglobalsolutions
 
(Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing (Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing
Gilles Perrouin
 
Generative Adversarial Networks for Robust Medical Image Analysis.pdf
Generative Adversarial Networks for Robust Medical Image Analysis.pdfGenerative Adversarial Networks for Robust Medical Image Analysis.pdf
Generative Adversarial Networks for Robust Medical Image Analysis.pdf
Daniel983829
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Jun Wang
 
Object Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNNObject Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNN
Minhazul Arefin
 
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
Seulki Park
 

Similar to [Explained] "Partial Success in Closing the Gap between Human and Machine Vision" Geirhos et al., 2021 (20)

Image generative modeling for design inspiration and image editing by Camille...
Image generative modeling for design inspiration and image editing by Camille...Image generative modeling for design inspiration and image editing by Camille...
Image generative modeling for design inspiration and image editing by Camille...
 
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
 
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
 
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
 
AnoMalNet: outlier detection based malaria cell image classification method l...
AnoMalNet: outlier detection based malaria cell image classification method l...AnoMalNet: outlier detection based malaria cell image classification method l...
AnoMalNet: outlier detection based malaria cell image classification method l...
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and Hype
 
AGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptxAGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptx
 
ageandgenderdetection-220802061020-9ee5a2cd.pptx
ageandgenderdetection-220802061020-9ee5a2cd.pptxageandgenderdetection-220802061020-9ee5a2cd.pptx
ageandgenderdetection-220802061020-9ee5a2cd.pptx
 
Generative Adversarial Networks and Their Applications in Medical Imaging
Generative Adversarial Networks  and Their Applications in Medical ImagingGenerative Adversarial Networks  and Their Applications in Medical Imaging
Generative Adversarial Networks and Their Applications in Medical Imaging
 
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Materials Science in the Era of Knowledge Discovery and Artificial Inteligence
Materials Science in the Era of Knowledge Discovery and Artificial InteligenceMaterials Science in the Era of Knowledge Discovery and Artificial Inteligence
Materials Science in the Era of Knowledge Discovery and Artificial Inteligence
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
researchpaper_2023_Skin_Csdbjsjvnvsdnfvancer.pdf
researchpaper_2023_Skin_Csdbjsjvnvsdnfvancer.pdfresearchpaper_2023_Skin_Csdbjsjvnvsdnfvancer.pdf
researchpaper_2023_Skin_Csdbjsjvnvsdnfvancer.pdf
 
IEEE Medical image Title and Abstract 2016
IEEE Medical image Title and Abstract 2016 IEEE Medical image Title and Abstract 2016
IEEE Medical image Title and Abstract 2016
 
(Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing (Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing
 
Generative Adversarial Networks for Robust Medical Image Analysis.pdf
Generative Adversarial Networks for Robust Medical Image Analysis.pdfGenerative Adversarial Networks for Robust Medical Image Analysis.pdf
Generative Adversarial Networks for Robust Medical Image Analysis.pdf
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Object Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNNObject Detection on Dental X-ray Images using R-CNN
Object Detection on Dental X-ray Images using R-CNN
 
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
[ICCV 21] Influence-Balanced Loss for Imbalanced Visual Classification
 

Recently uploaded

ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 

Recently uploaded (20)

ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 

[Explained] "Partial Success in Closing the Gap between Human and Machine Vision" Geirhos et al., 2021

  • 1. Cognitive Informatics Lab. Dept. of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University Sou Yoshihara (M2) arXiv:2106.07411 0
  • 2. Abstract: 3 findings l The longstanding robustness gap between humans and CNNs is closing l There is still a substantial image-level consistency gap, meaning that humans make different errors than models l In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude This evaluation is open-sourced as a benchmark to track future progress. (https://github.com/bethgelab/model-vs-human/) 1
  • 3. Introduction l Currently, models are routinely matching and in many cases even outperforming humans l At the same time, it is becoming increasingly clear that models systematically exploit shortcuts shared between training and test data l Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to more challenging testing conditions (e.g. real-world scenarios) (Geirhos et al., 2020, arXiv:2004.07780) toy example of shortcut learning in neural networks: When trained on a simple dataset of stars and moons, a standard fully connected neural network learns a shortcut strategy: classifying based on the location (stars in the top right or bottom left; moons in the top left or bottom right) rather than the shape of the objects (Geirhos et al., 2020) 2
  • 4. Out-of-Distribution (OOD) data l Out-of-Distribution (OOD) data: Testing models on more challenging test cases where there is still a ground truth category, but certain image statistics differ from the training distribution l Previous works ○ ImageNet-C (Hendrycks et al., 2019) ○ ImageNet-Sketch (Wang et al., 2019) ○ Stylized-ImageNet (Geirhos et al., 2019) They lack human comparison data The authors tested human observers in a lab on OOD datasets (85K psychophysical trials across 90 participants) ImageNet-C 4
  • 5. 17 OOD datasets high-pass colour vs. grayscale low contrast low-pass/blurring phase noise true power spectrum vs. power equalisation true vs. opponent colour rotation Eidolon I Eidolon II Eidolon III uniform noise ImageNet 160 160 1280 5
  • 6. 3 axes of models l Objective function (supervised vs. self-supervised, adversarially trained, and CLIP’s joint language-image training) l Architecture (convolutional vs. vision transformer) l Training dataset size (ranging from 1M to 1B images) 6
  • 7. Psychophysical experiments l Psychophysical experiments in a lab ○ 90 observers were tested in a darkened chamber ○ 16 categories (such as chair, dog, airplane, etc.) ○ 22” monitor with 1920 × 1200 pixels resolution (refresh rate: 120 Hz) ○ Viewing distance: 107 cm ○ Target images at the center subtended 3 × 3 degrees of visual angle ○ 200 ms followed by a 1/f backward mask Measures against COVID-19 risks 7
  • 8. Models l 16 categories: The 1000 class decision vector was mapped to those 16 classes using the WordNet hierarchy(Miller, 1995). l 52 models ○ 24 standard ImageNet-trained CNNs ○ 8 self-supervised models ○ 6 Big Transfer models ○ 5 adversarially trained models ○ 5 vision transformers ○ 2 semi-weakly supervised models ○ Noisy Student ○ CLIP 8
  • 9. Metrics l OOD accuracy (averaged across conditions and datasets) l Accuracy difference A(m) l Observed consistency O(m) l Error consistency E(m): It tracks whether there is above-chance consistency 1 if humans and a model are either both right or both wrong, 0 otherwise ! 𝑜!,# 𝑆$,% : expected consistency 𝑑: 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 ℎ: ℎ𝑢𝑚𝑎𝑛 𝑐: 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑠: 𝑠𝑎𝑚𝑝𝑙𝑒 (𝑎𝑛 𝑖𝑚𝑎𝑔𝑒) 9
  • 10. Error consistency E(m) l Error consistency E(m) l Cohen’s kappa indicates whether the observed consistency is larger than what could have been expected ! 𝑜!,# 𝑆$,% : expected consistency given two independent binomial decision makers with matched accuracy, only random consistency Cohen’s kappa agreement < 0 no agreement 0–0.20 slight 0.21–0.40 fair 0.41–0.60 moderate 0.61–0.80 substantial 0.81–1 almost perfect (Cohen's kappa, Wikipedia) (Eq. from Geirhos et al., 2020, arXiv:2006.16736) 𝑐&'(:𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦 𝑐)*+:𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦 Why error consistency? e.g.) Two decision makers with 95% accuracy each will have at least 90% observed consistency (intuitively, they both get most images correct and thus observed overlap is high) 𝑝:𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 1 − 𝑝:𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 10
  • 11. Results on OOD datasets The OOD robustness gap between human and machine vision is closing (top), but an image-level consistency gap remains (bottom, especially (d)). ↓ : trained on large-scale datasets humans standard supervised CNNs self-supervised models adversarially trained models vision transformers noisy student BiT SWSL CLIP 11
  • 12. Results on OOD datasets humans standard supervised CNNs self-supervised models adversarially trained models vision transformers noisy student BiT SWSL CLIP 12
  • 13. Robustness across models Self-supervised models SimCLR variants (SimCLR-x1, SimCLR-x2, SimCLR-x4) show strong generalisation improvements on uniform noise, low contrast, and high-pass images This is quite remarkable given that SimCLR models were trained on a different set of augmentations (random crop with flip and resize, colour distortion, and Gaussian blur) → Is the defining factor objective function or the choice of augmentations? https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html An illustration of SimCLR humans, standard supervised CNNs, self-supervised models, adversarially trained models, vision transformers, noisy student, BiT, SWSL, CLIP 13
  • 14. objective function vs. the choice of augmentations? triangles: self-supervised models stars: supervised baselines blue: ResNet x1 green: ResNet x4 red diamonds: human Self-supervised models vs. Augmentation-matched supervised baseline models Augmentations: random crop with flip and resize colour distortion Gaussian blur The augmentation scheme (rather than the self-supervised objective) indeed made the crucial difference: augmentation-matched supervised baselines show just the same generalisation behaviour. 15
  • 15. Robustness across models Adversarially trained models The stronger the model is trained adversarially (darker shades of blue), the more susceptible it becomes to (random) image degradations. A simple rotation by 90 degrees leads to a 50% drop in classification accuracy. Adversarial robustness seems to come at the cost of increased vulnerability to large-scale perturbations. Adversarial examples (Goodfellow et al., 2014) humans, standard supervised CNNs, self-supervised models, adversarially trained models, vision transformers, noisy student, BiT, SWSL, CLIP 16
  • 16. Adversarial training increases shape bias There is a relationship between shape bias and the degree of adversarial training 𝑠ℎ𝑎𝑝𝑒 𝑏𝑖𝑎𝑠 = %&''(%) 𝒔𝒉𝒂𝒑𝒆 $(%/0/&10 %&''(%) 𝒔𝒉𝒂𝒑𝒆 $(%/0/&10 7%&''(%) 𝒕𝒆𝒙𝒕𝒖𝒓𝒆 $(%/0/&10 shape texture Stimuli: texture-shape cue conflict images Made by Style Transfer (Gatys et al., 2016) 1,200 images (16 classes, 75 images per shape label) URL: https://github.com/rgeirhos/texture-vs-shape/tree/master/stimuli/style-transfer-preprocessed-512 (Geirhos et al., 2019) 17
  • 17. Robustness across models Vision transformers The best vision transformer (ViT-L trained on 14M images) even exceeds human OOD accuracy Vision transformers trained on 1M images (light green) are already better than standard convolutional models Higher shape bias (The results are not shown here) c.f. Tuli et al. (2021), Naseer et al. (2021) (Dosovitskiy et al., 2020, arXiv:2010.11929) ↓ : trained on large-scale datasets humans, standard supervised CNNs, self-supervised models, adversarially trained models, vision transformers, noisy student, BiT, SWSL, CLIP 19
  • 18. Robustness across models Standard models trained on more data: BiT-M, SWSL, Noisy Student The biggest effect on OOD robustness simply comes from training on larger datasets, not from advanced architectures BiT: Big Transfer (Kolesnikov et al., 2019, arXiv:1912.11370) ResNet152x4: # neurons of each layer is 4x BiT-M: trained on ImageNet-21k SWSL: Semi-Weakly Supervised Learning (Yalniz et al., 2019, arXiv:1905.00546) Trained on 940M Noisy Student (Xie et al., 2020, arXiv:1911.04252) Trained on 300M ↓ : trained on large-scale datasets humans, standard supervised CNNs, self-supervised models, adversarially trained models, vision transformers, noisy student, BiT, SWSL, CLIP 20
  • 19. Robustness across models ↓ : trained on large-scale datasets CLIP “Special” - more data: trained on 400M - novel objective: joint language-image supervision - non-standard architecture: a vision transformer backbone The most human-like model across all of metrics humans, standard supervised CNNs, self-supervised models, adversarially trained models, vision transformers, noisy student, BiT, SWSL, CLIP (Radford et al., 2021, arXiv:2103.00020) 21
  • 20. Error consistency between models (“sketch” dataset) Do they make errors on the same individual images? (“sketch” dataset) A standard supervised model, a self-supervised model, an adversarially trained model or a vision transformer, all those models make highly systematic errors Humans show a very different pattern of errors The boundary between humans and some data-rich models, especially CLIP (400M images) and SWSL (940M), is blurry, making more human-like errors than standard models. Error consistency analysis on a single dataset, ”sketch” (for other datasets see the original paper, Figures 9, 11, 12, 13, 14) 22
  • 21. Error consistency between models (17 OOD datasets) We can see that data-rich models approach human-to-human observed consistency, but not error consistency. Observed consistency is not a good measure of image-level consistency since it does not take consistency by chance into account; error consistency tracks whether there is consistency beyond chance. We see that there is still a substantial image-level consistency gap between human and machine vision. However, several models improve over vanilla CNNs, especially BiT-M (trained on 14M images) and CLIP (400M images). 23
  • 22. Error consistency aggregated over multiple datasets humans standard supervised CNNs self-supervised models adversarially trained models vision transformers noisy student BiT SWSL CLIP (sketch, silhouette, edge, cue conflict, low-pass) ○ : convolutional ▽: vision transformer ◇: human OOD accuracy is a near-perfect predictor of image-level consistency; especially data-rich models (e.g. CLIP, SWSL, BiT) narrow the consistency gap to humans Training on large-scale datasets leads to considerable improvements along both architectures, convolutional and vision transformer. (stylized, colour/greyscale, contrast, high-pass, phase-scrambling, power- equalisation, false colour, rotation, eidolonI, -II and -III as well as uniform noise), The human-machine gap is large; here, more robust models do not show improved error consistency 24
  • 23. Error consistency l It remains an open question why the training dataset appears to have the most important impact on a model’s decision boundary as measured by error consistency (as opposed to other aspects of a model’s inductive bias). l Datasets contain various shortcut opportunities (Geirhos et al. 2020), and if two different models are trained on similar data, they might converge to a similar solution simply by exploiting the same shortcuts. Making models more flexible (such as transformers, a generalisation of CNNs) wouldn’t change much in this regard l What affects error consistency? Dataset vs. Architecture, Flexibility vs. Constraints →Next slide 25
  • 24. Error consistency Error consistency between two identical models trained on very different datasets, such as ImageNet vs. Stylized-ImageNet (Geirhos et al., 2019), is much lower than error consistency between very different models (ResNet-50 vs. VGG-16) trained on the same dataset. Dataset vs. Architecture Flexibility vs. constraints Error consistency between ResNet-50 and a highly flexible model (e.g., a vision transformer) is much higher than error consistency between ResNet-50 and a highly constrained model like BagNet-9 (Brendel and Bethge, 2019, arXiv:1904.00760) BagNet The models extract features from small image patches Stylized-ImageNet (SIN) “texture-less dataset” → Dataset is more important → Flexibility is more important 26
  • 25. Summary l While self-supervised and adversarially trained models lack OOD robustness, models based on vision transformers and/or trained on large-scale datasets now match or exceed human performance on most datasets l The OOD robustness gap between human and machine vision is closing, as the best models now match or exceed human accuracies. l At the same time, an image-level consistency gap remains, however, this gap that is at least in some cases narrowing for models trained on large-scale datasets. This evaluation is open-sourced as a benchmark to track future progress. (https://github.com/bethgelab/model-vs-human/) 27
  • 26. (Supplementary) Eidolon https://github.com/gestaltrevision/Eidolon Geirhos, R., Temme, C. R. M., Rauber, J., Schütt, H. H., Bethge, M., & Wichmann, F. A. (2018). Generalisation in humans and deep neural networks. http://arxiv.org/abs/1808.08750 28