Abstract: While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simple, shallow functions, and fail to account for many nuances of human perception. Recently, the deep learning community has found that features of the VGG network trained on ImageNet classification have been remarkably useful as a training loss for image synthesis. But how perceptual are these so-called "perceptual losses"? What elements are critical for their success? To answer these questions, we introduce a new dataset of human perceptual similarity judgments. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics. We find that deep features outperform all previous metrics by large margins on our dataset. More surprisingly, this result is not restricted to ImageNet-trained VGG features, but holds across different deep architectures and levels of supervision (supervised, self-supervised, or even unsupervised). Our results suggest that perceptual similarity is an emergent property shared across deep visual representations.
Despite their strong transfer performance, deep convolutional representations surprisingly lack a basic low-level property -- shift-invariance, as small input shifts or translations can cause drastic changes in the output. Commonly used downsampling methods, such as max-pooling, strided-convolution, and average-pooling, ignore the sampling theorem. The well-known signal processing fix is anti-aliasing by low-pass filtering before downsampling. However, simply inserting this module into deep networks degrades performance; as a result, it is seldomly used today. We show that when integrated correctly, it is compatible with existing architectural components, such as max-pooling and strided-convolution. We observe increased accuracy in ImageNet classification, across several commonly-used architectures, such as ResNet, DenseNet, and MobileNet, indicating effective regularization. Furthermore, we observe better generalization, in terms of stability and robustness to input corruptions. Our results demonstrate that this classical signal processing technique has been undeservingly overlooked in modern deep networks.
20. Egomotion
Agrawal et al. ICCV 2015. Jayaraman et al. ICCV 2015.
Context
Noroozi and Favaro. ECCV 2016.
Doersch et al. ICCV 2015.
Pathak et al. CVPR 2016.
Hinton & Salakhutdinov.
Science 2006.
Wang et al. ICCV 2015. Pathak et al. CVPR 2017.
Misra et al. ECCV 2016.
de Sa. NIPS 1994.
Video
Audio
Autoencoders Denoising Autoencoders
Vincent et al. ICML 2008.
Goal: Set up a pre-training scheme to induce a “useful” representation
Owens et al. ECCV 2016.
Arandjelovic & Zisserman. ICCV 17
Generative Modeling
Donahue et al. Dumoulin et al. ICLR 2017.
21
23. 24
X
Raw Data
X1
X2
Induce abstraction through prediction
Cross-Channel Encoder
c.f. Larsson et al. Colorization as a Proxy Task for Visual Understanding. In CVPR, 2017.
24. 25
Zhang, Isola, Efros. Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction. In CVPR, 2017.
c.f. LeCun, Denker, Solla. Optimal Brain Damage. In NIPS, 1989.
X
X1
X2
X1
X2
X
Split-Brain Autoencoder
25. Input Image X Predicted Image X
26
Split-Brain Autoencoder on Images
29. 31
Task & Dataset Generalization
Does the feature representation transfer
to other tasks and datasets?
Berkeley-Adobe Perceptual
Patch Similarity Dataset
Zhang, Isola, Efros, Shechtman, Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR, 2018.
D ( , )
Rock beauty
Classification Perceptual Judgments
30. 32
Zhang, Isola, Efros, Shechtman, Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR, 2018.
How different are these patches?
31. Which patch is more similar to the middle?
Humans
L2/PSNR
SSIM/FSIMc
Deep Networks?
< Clap >< Clap >
33
32. “Perceptual Losses”
34
Gatys et al. In CVPR, 2016.
Johnson et al. In ECCV, 2016.
Dosovitskiy and Brox. In NIPS, 2016.
Chen and Koltun. In ICCV, 2017.
33. Deep Networks as a Perceptual Metric
𝐹 𝐹
Normalize,
Subtract
L2 norm,
Spatial average
𝑥 𝑥0
Avg
𝑑0
(1) How well do “perceptual losses” describe perception?
(2) Does it have to be VGG trained on classification?
c.f. Gatys et al. CVPR 2016. Johnson et al. ECCV 2016. Dosovitskiy and Brox. NIPS 2016.
35
37. 40
% agreement with
human judges
Bigger/Deeper ≠ Better
Networks perform strongly across supervisory
signals and architectures
Fitting some data
is important
82.6
68.9
Low-level
AlexNet (Random)
AlexNet (Unsupervised)
AlexNet (Self-supervised)
Nets (Supervised -
Imagenet classification)
Human
75.7
76.8
78.0
76.876.4
75.5
74.8
70.6
70.069.7
VGG on classification
(“perceptual loss”) correlates well
Can we train on perceptual judgments?
40. Training a Perceptual Metric
𝐹 𝐹
Normalize,
Subtract
L2 norm,
Spatial average
𝑥 𝑥0
Avg
𝑑0
c.f. Gatys et al. CVPR 2016. Johnson et al. ECCV 2016. Dosovitskiy and Brox. NIPS 2016.
𝑤
Off-the-shelf (w=1)
Frozen (w learned)
Tuned (w learned)
41. Off-the-shelf networks already perform well
Training a linear layer on top yields small performance boost.
Fine-tuning through representation leads to overfitting
Training distribution
82.6
76.8
80.6
Held-out distribution
69.5
65.0
64.3
Training a Perceptual Metric
Off-the-shelf Tuned Human Off-the-shelf Tuned Human
78.7
Frozen
65.3
Frozen
“LPIPS” metric: richzhang.github.io/PerceptualSimilarity
Additionally
- Ensembled-LPIPS. Kettunen, Härkönen, Lehtinen. ArXiv 2019.
- Audio domain. Manocha, Finkelstein, Jin, Bryan, Zhang, Mysore. In progress.
43. Deep Networks are not Shift-Invariant
P(correct class) P(correct class)
44. Deep Networks are not Shift-Invariant
Azulay and Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? In ArXiv, 2018.
Engstrom, Tsipras, Schmidt, Madry. Exploring the Landscape of Spatial Robustness. In ICML, 2019.
P(correct class) P(correct class)
45. Azulay and Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? In ArXiv, 2018.
Engstrom, Tsipras, Schmidt, Madry. A rotation and a translation suffice: Fooling cnns with simple transformations. In ArXiv,
Deep Networks are not Shift-Invariant
76. Perfect shift-eq.
Large deviation
from shift-eq.
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Shift-equivariance, per layer
Every pooling increases periodicity
77. Alternative downsampling methods
• Blur+subsample
• Antialiasing in signal processing; image processing; graphics
• Max-pooling
• Performs better in deep learning applications [Scherer 2010]
88
78. Alternative downsampling methods
• Blur+subsample
• Antialiasing in signal processing; image processing; graphics
• Max-pooling
• Performs better for deep learning [Scherer 2010]
89
79. Alternative downsampling methods
• Blur+subsample
• Antialiasing in signal processing; image processing; graphics
• Max-pooling
• Performs better for deep learning [Scherer 2010]
90
Reconcile antialiasing with max-pooling
80. Max (densely)
Preserves shift-equivariance
max( )
max( )
Anti-aliased pooling (MaxBlurPool)
Shift-equivariance lost; heavy aliasing
max( )
Strided-MaxPool
max( )
Blur
Preserves shift-eq.
Blur
Equivalent Interpretation
Shift-eq. lost; heavy aliasing
Shift eq. lost, but reduced aliasing
Subsampling
Evaluated together as “BlurPool”
92. Striding aliases(stride=2)
Add antialiasing filter
+ shift-equivariance
+ accuracy
Additionally
+ stability to other perturbations
+ robustness to corruptions
104
Discussion
93. Discriminative Learning
105
“Rock beauty”
Semantic
Labels
Textures? Edges? Parts?
Perceptual similarity?
+ Solve discriminative tasks
+ Learn about the visual world + Force the network to learn about the
visual world for free
Raw Unlabeled
Data
Textures? Edges? Parts?
Perceptual similarity?
Image Synthesis
+ Solve graphics tasks
+ Engineering in inductive biases (shift-invariance) still valuable
94. Computer Vision · Graphics · Deep Learning · Machine Learning
Human Computer Interaction · Natural Language Processing
San Jose San Francisco Seattle
102. Train without Data Augmentation Train with Data Augmentation
Shift-Invariance vs Classification Accuracy
Data augmentation increases both
accuracy and shift-invariance
Engineering in shift-invariance is
“free” data augmentation
Boosts shift-invariance while
maintaining accuracy
103. Our Distortions Real Algorithm Outputs
Human
L2/PSNR
SSIM
FSIMc
Gaussian
K-Means
Watch Obj.
Split-Brain
Puzzle
BiGAN
SqueezeNet
AlexNet
VGG
Low-level
Net (Random)
Net (Unsupervised)
Net (Self-supervised)
Net (Supervised)
Human
Percep-Trained (Frozen)
Percep-Trained (Tuned)
Near human-level performance
within our distortions
Linear “calibration” on our
distortions transfers successfully
Training on direct task is not
the complete solution
115
106. Grayscale Input Larsson et al.
In ECCV, 2016.
Iizuka et al.
In SIGGRAPH, 2016.
Zhang, Isola, Efros.
In ECCV, 2016.
Automatic Results with Deep Networks
118Dorothea Lange. Migrant Mother, 1936.
Library of Congress, Prints & Photographs Division, FSA/OWI Collection, reproduction number: LC-USF34-9058-C
107. 119
Dorothea Lange.
Migrant Mother, 1936.
Library of Congress, Prints & Photographs
Division, FSA/OWI Collection, reproduction
number: LC-USF34-9058-C
Zhang*, Zhu*, Isola, Geng, Lin, Yu, Efros. Real-Time User-Guided Image Colorization with Learned Deep Priors. In SIGGRAPH, 2017.
108. 120
Zhang*, Zhu*, Isola, Geng, Lin, Yu, Efros. Real-Time User-Guided Image Colorization with Learned Deep Priors. In SIGGRAPH, 2017.
Dorothea Lange.
Migrant Mother, 1936.
Library of Congress, Prints & Photographs
Division, FSA/OWI Collection, reproduction
number: LC-USF34-9058-C
109. 121
Zhang*, Zhu*, Isola, Geng, Lin, Yu, Efros. Real-Time User-Guided Image Colorization with Learned Deep Priors. In SIGGRAPH, 2017.
Dorothea Lange.
Migrant Mother, 1936.
Library of Congress, Prints & Photographs
Division, FSA/OWI Collection, reproduction
number: LC-USF34-9058-C