Modeling Perceptual Similarity and
Shift-Invariance in Deep Networks
Richard Zhang
Adobe Research, San Francisco
(formerly UC Berkeley EECS PhD)
1
Discriminative Deep Networks
2
“Rock beauty”
Textures? Edges? Parts?
Perceptual similarity?
Discriminative Deep Networks
3
Raw, Unlabeled
Pixels
Textures? Edges? Parts?
Perceptual similarity?
Generative Deep Networks
4
Raw, Unlabeled
Pixels
Textures? Edges? Parts?
Perceptual similarity?
Grayscale image: L channel Color information: ab channels
abL
Zhang, Isola, Efros. Colorful Image Colorization. In ECCV, 2016. c.f. Larsson, Maire, Shakhnarovich. In ECCV, 2016.
5
Concatenate (L,ab) channels
6
Grayscale image: L channel
abL
Zhang, Isola, Efros. Colorful Image Colorization. In ECCV, 2016. c.f. Larsson, Maire, Shakhnarovich. In ECCV, 2016.
Initial colorization attempt
Colors in ab space
(continuous)Loss Function
7
Training data
…
Colors in ab space
(continuous)Loss Function
• Regression with L2 loss inadequate
8
Better Loss Function
Colors in ab space
(discrete)
9
• Regression with L2 loss inadequate
• Use per-pixel multinomial classification
10
Input ClassificationL2 RegressionGround Truth
11
Accounting for multimodality in the
output space improves visual realism
Ansel Adams. Yosemite Valley Bridge. 13
14
Dr. David Fleay, Thylacine (extinct in 1936).
Amateur Family Photo, 1956. 15
Automatic Colorization
16
17
Automatic Colorization (Informative Mistake)
Cross-Channel Encoder
256
lightness
conv1 conv2 conv3 conv4 fc6conv5
64
33
17
192
256
384
17
384
17
256 4096
17
4096
17
fc7
17
fc8
ab color
+L
313
Hidden Unit Activations
Zhou et al. In ICLR, 2015.
18
c.f. Larsson et al. Colorization as a Proxy Task for Visual Understanding. In CVPR, 2017.
Hidden Unit (conv5) Activations
sky
trees
water
19
faces
dog
faces
flowers
Hidden Unit (conv5) Activations
20
Egomotion
Agrawal et al. ICCV 2015. Jayaraman et al. ICCV 2015.
Context
Noroozi and Favaro. ECCV 2016.
Doersch et al. ICCV 2015.
Pathak et al. CVPR 2016.
Hinton & Salakhutdinov.
Science 2006.
Wang et al. ICCV 2015. Pathak et al. CVPR 2017.
Misra et al. ECCV 2016.
de Sa. NIPS 1994.
Video
Audio
Autoencoders Denoising Autoencoders
Vincent et al. ICML 2008.
Goal: Set up a pre-training scheme to induce a “useful” representation
Owens et al. ECCV 2016.
Arandjelovic & Zisserman. ICCV 17
Generative Modeling
Donahue et al. Dumoulin et al. ICLR 2017.
21
22
X
Raw Data
Traditional Autoencoder
Induce abstraction through compression
෡X
23
Denoising Autoencoder
Induce abstraction through reconstruction
෡X
Corrupted
Raw Data
X
Raw Data
24
X
Raw Data
X1
෢X2
Induce abstraction through prediction
Cross-Channel Encoder
c.f. Larsson et al. Colorization as a Proxy Task for Visual Understanding. In CVPR, 2017.
25
Zhang, Isola, Efros. Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction. In CVPR, 2017.
c.f. LeCun, Denker, Solla. Optimal Brain Damage. In NIPS, 1989.
෡X
X1
X2
෢X1
෢X2
X
Split-Brain Autoencoder
Input Image X Predicted Image ෡X
26
Split-Brain Autoencoder on Images
Input
RGB-HHA
image
RGB channels
HHA depth channels
HHA depth channels
RGB channels
Predicted
RGB-HHA
image
27
Split-Brain Autoencoder on RGB-D
Class supervision
Pre-Trained Model
How semantically organized is
the feature space? Top-1ClassificationAccuracy
40
35
30
25
20
15
10
5
0
45
50
Layer
conv1 pool1 conv2 pool2 conv3 conv4 conv5 pool5
227
conv1 conv2 conv3 conv4 conv5
27
13
96
256
384
13
384
13
256
Random
Supervised pre-training
Krizhevsky et al. In NIPS, 2012.
Performance Comparison
X ෡X
X1
X2
෢X1
෢X2
29
෡XX
෡XX
෢X2
X
X1
Top-1ClassificationAccuracy
40
35
30
25
20
15
10
5
0
45
50
Random
Supervised pre-training
Krizhevsky et al. In NIPS, 2012.
Layer
conv1 pool1 conv2 pool2 conv3 conv4 conv5 pool5
Doersch et al. In ICCV, 2015.
Donahue et al. In ICLR, 2017.Pathak et al. In CVPR, 2016.
Krähenbühl et al. In ICLR, 2016.
Split-Brain Autoencoder
Colorization
No split
Corrupted input
(Regression
)
Accounting for multimodality
improves representation
31
Task & Dataset Generalization
Does the feature representation transfer
to other tasks and datasets?
Berkeley-Adobe Perceptual
Patch Similarity Dataset
Zhang, Isola, Efros, Shechtman, Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR, 2018.
D ( , )
Rock beauty
Classification Perceptual Judgments
32
Zhang, Isola, Efros, Shechtman, Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR, 2018.
How different are these patches?
Which patch is more similar to the middle?
Humans
L2/PSNR
SSIM/FSIMc
Deep Networks?
< Clap >< Clap >
33
“Perceptual Losses”
34
Gatys et al. In CVPR, 2016.
Johnson et al. In ECCV, 2016.
Dosovitskiy and Brox. In NIPS, 2016.
Chen and Koltun. In ICCV, 2017.
Deep Networks as a Perceptual Metric
𝐹 𝐹
Normalize,
Subtract
L2 norm,
Spatial average
𝑥 𝑥0
Avg
𝑑0
(1) How well do “perceptual losses” describe perception?
(2) Does it have to be VGG trained on classification?
c.f. Gatys et al. CVPR 2016. Johnson et al. ECCV 2016. Dosovitskiy and Brox. NIPS 2016.
35
Original Patch
(1) Traditional Distortions
Distorted Patches
Noise
Photometric
Spatial warps
Compression
Blur
36
Distorted Patches
Original Patch
Goal: Imitate CNN artifacts from real algorithms
(2) CNN-Based Distortions
𝐺
Trained with random:
- Input corruption
- Generator architecture
- Discriminator architecture
- Loss function
37
38
Which patch is more similar to the middle?
40
% agreement with
human judges
Bigger/Deeper ≠ Better
Networks perform strongly across supervisory
signals and architectures
Fitting some data
is important
82.6
68.9
Low-level
AlexNet (Random)
AlexNet (Unsupervised)
AlexNet (Self-supervised)
Nets (Supervised -
Imagenet classification)
Human
75.7
76.8
78.0
76.876.4
75.5
74.8
70.6
70.069.7
VGG on classification
(“perceptual loss”) correlates well
Can we train on perceptual judgments?
Traditional
Distortion Types
41
Hand-defined
functions
Original
Patch
Distorted
Patches
CNN-Based
Distorted
Patches
Original
Patch
Family of
Generators
Real Algorithm Outputs
Training distribution Held-out test distribution
Original Patch
Tasks: superresolution, colorization,
frame interpolation, video deblurring
Real Algorithm Outputs
Downsample
+
Superressolution
43
Training a Perceptual Metric
𝐹 𝐹
Normalize,
Subtract
L2 norm,
Spatial average
𝑥 𝑥0
Avg
𝑑0
c.f. Gatys et al. CVPR 2016. Johnson et al. ECCV 2016. Dosovitskiy and Brox. NIPS 2016.
𝑤
Off-the-shelf (w=1)
Frozen (w learned)
Tuned (w learned)
Off-the-shelf networks already perform well
Training a linear layer on top yields small performance boost.
Fine-tuning through representation leads to overfitting
Training distribution
82.6
76.8
80.6
Held-out distribution
69.5
65.0
64.3
Training a Perceptual Metric
Off-the-shelf Tuned Human Off-the-shelf Tuned Human
78.7
Frozen
65.3
Frozen
“LPIPS” metric: richzhang.github.io/PerceptualSimilarity
Additionally
- Ensembled-LPIPS. Kettunen, Härkönen, Lehtinen. ArXiv 2019.
- Audio domain. Manocha, Finkelstein, Jin, Bryan, Zhang, Mysore. In progress.
Example classifications
P(correct class) P(correct class)
Deep Networks are not Shift-Invariant
P(correct class) P(correct class)
Deep Networks are not Shift-Invariant
Azulay and Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? In ArXiv, 2018.
Engstrom, Tsipras, Schmidt, Madry. Exploring the Landscape of Spatial Robustness. In ICML, 2019.
P(correct class) P(correct class)
Azulay and Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? In ArXiv, 2018.
Engstrom, Tsipras, Schmidt, Madry. A rotation and a translation suffice: Fooling cnns with simple transformations. In ArXiv,
Deep Networks are not Shift-Invariant
Making Convolutional Networks
Shift-Invariant Again
Richard Zhang
In ICML, 2019.
Why is shift-invariance lost?
“Convolutions are shift-equivariant”
“Pooling builds up shift-invariance”
…but striding ignores Nyquist sampling theorem
and aliases
51
54
Re-examining Max-Pooling
55
Re-examining Max-Pooling
max
56
Re-examining Max-Pooling
max
57
Re-examining Max-Pooling
max
58
Re-examining Max-Pooling
max
59
Re-examining Max-Pooling
max
60
Re-examining Max-Pooling
max
61
Re-examining Max-Pooling
max
62
Re-examining Max-Pooling
max
63
Re-examining Max-Pooling
64
Re-examining Max-Pooling
max
65
Re-examining Max-Pooling
max
66
Re-examining Max-Pooling
max
67
Re-examining Max-Pooling
max
68
Re-examining Max-Pooling
max
69
Re-examining Max-Pooling
max
70
Re-examining Max-Pooling
71
Re-examining Max-Pooling
72
Re-examining Max-Pooling
73
Re-examining Max-Pooling
Max-pooling breaks shift-equivariance
Shift-equivariance Testbed
• CIFAR
• VGG network
• 5 max-pools
• Test shift-equivariance condition
•
• Circular convolution/shift
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
32x32
1x1
Perfect shift-eq.
Large deviation
from shift-eq.
75
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Shift-equivariance, per layer
Convolution is shift-equivariant
Perfect shift-eq.
Large deviation
from shift-eq.
76
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Shift-equivariance, per layer
Pooling breaks shift-equivariance
77
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Shift-equivariance, per layer
Perfect shift-eq.
Large deviation
from shift-eq.
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Shift-equivariance, per layer
Perfect shift-eq.
Large deviation
from shift-eq.
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Shift-equivariance, per layer
Perfect shift-eq.
Large deviation
from shift-eq.
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Shift-equivariance, per layer
Perfect shift-eq.
Large deviation
from shift-eq.
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Shift-equivariance, per layer
Perfect shift-eq.
Large deviation
from shift-eq.
Perfect shift-eq.
Large deviation
from shift-eq.
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Shift-equivariance, per layer
Every pooling increases periodicity
Alternative downsampling methods
• Blur+subsample
• Antialiasing in signal processing; image processing; graphics
• Max-pooling
• Performs better in deep learning applications [Scherer 2010]
88
Alternative downsampling methods
• Blur+subsample
• Antialiasing in signal processing; image processing; graphics
• Max-pooling
• Performs better for deep learning [Scherer 2010]
89
Alternative downsampling methods
• Blur+subsample
• Antialiasing in signal processing; image processing; graphics
• Max-pooling
• Performs better for deep learning [Scherer 2010]
90
Reconcile antialiasing with max-pooling
Max (densely)
Preserves shift-equivariance
max( )
max( )
Anti-aliased pooling (MaxBlurPool)
Shift-equivariance lost; heavy aliasing
max( )
Strided-MaxPool
max( )
Blur
Preserves shift-eq.
Blur
Equivalent Interpretation
Shift-eq. lost; heavy aliasing
Shift eq. lost, but reduced aliasing
Subsampling
Evaluated together as “BlurPool”
92
(1) Max (densely)
max
93
(2) Blur+Subsample
Shift-equivariance better preserved
Anti-aliased
Perfect
shift-eq.
Large
deviation
Baseline
Perfect
shift-eq.
Large
deviation
Anti-aliased
Perfect
shift-eq.
Large
deviation
MaxPool
(stride 2)
Max
(stride 1)
BlurPool
(stride 2)
Max Pooling
(VGG, AlexNet)
Conv
(stride 2)
ReLU
Conv
(stride 1)
ReLU BlurPool
(stride 2)
Strided-Convolution
(ResNet, MobileNetv2)
Antialiasing any downsampling layer
AvgPool
(stride 2)
BlurPool
(stride 2)
Average Pooling
(DenseNet)
ImageNet
Shift-invariance
Accuracy
ImageNet
Shift-invariance
Accuracy
Baseline
Antialiased
ImageNet
Shift-invariance
Accuracy
Antialiasing also
improves accuracy
Baseline
Antialiased
Qualitative examples
Image-to-Image Translation
103
Input Baseline Antialias [1 1] Antialias [1 2 1]
Isola et al. Pix2pix. CVPR 2017.
Striding aliases(stride=2)
Add antialiasing filter
+ shift-equivariance
+ accuracy
Additionally
+ stability to other perturbations
+ robustness to corruptions
104
Discussion
Discriminative Learning
105
“Rock beauty”
Semantic
Labels
Textures? Edges? Parts?
Perceptual similarity?
+ Solve discriminative tasks
+ Learn about the visual world + Force the network to learn about the
visual world for free
Raw Unlabeled
Data
Textures? Edges? Parts?
Perceptual similarity?
Image Synthesis
+ Solve graphics tasks
+ Engineering in inductive biases (shift-invariance) still valuable
Computer Vision · Graphics · Deep Learning · Machine Learning
Human Computer Interaction · Natural Language Processing
San Jose San Francisco Seattle
Thank You!
Anti-aliasing CNNs
Zhang. In ICML, 2019.
richzhang.github.io/antialiased-cnns
Perceptual Similarity
Zhang, Isola, Efros, Shechtman, Wang.
In CVPR, 2018.
richzhang.github.io/PerceptualSimilarity
Split-Brain Autoencoder
Zhang, Isola, Efros. In CVPR, 2017.
richzhang.github.io/splitbrainauto
Colorization
Zhang, Isola, Efros. ECCV, 2016.
Zhang*, Zhu*, et al. SIGGRAPH, 2017.
richzhang.github.io/colorization
richzhang.github.io/ideepcolor
BicycleGAN
Zhu, Zhang, Pathak, Darrell, Efros,
Wang, Shechtman. In NIPS, 2017.
junyanz.github.io/BicycleGAN
Detecting Photoshop.
Wang, Wang, Owens, Zhang, Efros.
In ICCV, 2019.
peterwang512.github.io/FALdetector
Backup
108
Baseline vs Anti-aliased
110
111
112
113
Train without Data Augmentation Train with Data Augmentation
Shift-Invariance vs Classification Accuracy
Data augmentation increases both
accuracy and shift-invariance
Engineering in shift-invariance is
“free” data augmentation
Boosts shift-invariance while
maintaining accuracy
Our Distortions Real Algorithm Outputs
Human
L2/PSNR
SSIM
FSIMc
Gaussian
K-Means
Watch Obj.
Split-Brain
Puzzle
BiGAN
SqueezeNet
AlexNet
VGG
Low-level
Net (Random)
Net (Unsupervised)
Net (Self-supervised)
Net (Supervised)
Human
Percep-Trained (Frozen)
Percep-Trained (Tuned)
Near human-level performance
within our distortions
Linear “calibration” on our
distortions transfers successfully
Training on direct task is not
the complete solution
115
Failure Cases
– Method can make mistakes
116
Dorothea Lange, Migrant Mother, 1936.
Dorothea Lange.
Migrant Mother, 1936.
Library of Congress, Prints & Photographs
Division, FSA/OWI Collection, reproduction
number: LC-USF34-9058-C
?
?
?
?
117
Grayscale Input Larsson et al.
In ECCV, 2016.
Iizuka et al.
In SIGGRAPH, 2016.
Zhang, Isola, Efros.
In ECCV, 2016.
Automatic Results with Deep Networks
118Dorothea Lange. Migrant Mother, 1936.
Library of Congress, Prints & Photographs Division, FSA/OWI Collection, reproduction number: LC-USF34-9058-C
119
Dorothea Lange.
Migrant Mother, 1936.
Library of Congress, Prints & Photographs
Division, FSA/OWI Collection, reproduction
number: LC-USF34-9058-C
Zhang*, Zhu*, Isola, Geng, Lin, Yu, Efros. Real-Time User-Guided Image Colorization with Learned Deep Priors. In SIGGRAPH, 2017.
120
Zhang*, Zhu*, Isola, Geng, Lin, Yu, Efros. Real-Time User-Guided Image Colorization with Learned Deep Priors. In SIGGRAPH, 2017.
Dorothea Lange.
Migrant Mother, 1936.
Library of Congress, Prints & Photographs
Division, FSA/OWI Collection, reproduction
number: LC-USF34-9058-C
121
Zhang*, Zhu*, Isola, Geng, Lin, Yu, Efros. Real-Time User-Guided Image Colorization with Learned Deep Priors. In SIGGRAPH, 2017.
Dorothea Lange.
Migrant Mother, 1936.
Library of Congress, Prints & Photographs
Division, FSA/OWI Collection, reproduction
number: LC-USF34-9058-C

Modeling perceptual similarity and shift invariance in deep networks