Modeling perceptual similarity and shift invariance in deep networks

Modeling Perceptual Similarity and
Shift-Invariance in Deep Networks
Richard Zhang
Adobe Research, San Francisco
(formerly UC Berkeley EECS PhD)
1

Discriminative Deep Networks
2
“Rock beauty”
Textures? Edges? Parts?
Perceptual similarity?

Discriminative Deep Networks
3
Raw, Unlabeled
Pixels

Generative Deep Networks
4
Raw, Unlabeled
Pixels

Grayscale image: L channel Color information: ab channels
abL
Zhang, Isola, Efros. Colorful Image Colorization. In ECCV, 2016. c.f. Larsson, Maire, Shakhnarovich. In ECCV, 2016.
5

Concatenate (L,ab) channels
6
Grayscale image: L channel
abL
Zhang, Isola, Efros. Colorful Image Colorization. In ECCV, 2016. c.f. Larsson, Maire, Shakhnarovich. In ECCV, 2016.

Initial colorization attempt
Colors in ab space
(continuous)Loss Function
7
Training data
…

Colors in ab space
(continuous)Loss Function
• Regression with L2 loss inadequate
8

Better Loss Function
Colors in ab space
(discrete)
9
• Regression with L2 loss inadequate
• Use per-pixel multinomial classification

Input ClassificationL2 RegressionGround Truth
11
Accounting for multimodality in the
output space improves visual realism

Ansel Adams. Yosemite Valley Bridge. 13

14
Dr. David Fleay, Thylacine (extinct in 1936).

Amateur Family Photo, 1956. 15

17
Automatic Colorization (Informative Mistake)

Cross-Channel Encoder
256
lightness
conv1 conv2 conv3 conv4 fc6conv5
64
33
17
192
256
384
17
384
17
256 4096
17
4096
17
fc7
17
fc8
ab color
+L
313
Hidden Unit Activations
Zhou et al. In ICLR, 2015.
18
c.f. Larsson et al. Colorization as a Proxy Task for Visual Understanding. In CVPR, 2017.

Hidden Unit (conv5) Activations
sky
trees
water
19

faces
dog
faces
flowers
Hidden Unit (conv5) Activations
20

Egomotion
Agrawal et al. ICCV 2015. Jayaraman et al. ICCV 2015.
Context
Noroozi and Favaro. ECCV 2016.
Doersch et al. ICCV 2015.
Pathak et al. CVPR 2016.
Hinton & Salakhutdinov.
Science 2006.
Wang et al. ICCV 2015. Pathak et al. CVPR 2017.
Misra et al. ECCV 2016.
de Sa. NIPS 1994.
Video
Audio
Autoencoders Denoising Autoencoders
Vincent et al. ICML 2008.
Goal: Set up a pre-training scheme to induce a “useful” representation
Owens et al. ECCV 2016.
Arandjelovic & Zisserman. ICCV 17
Generative Modeling
Donahue et al. Dumoulin et al. ICLR 2017.
21

22
X
Raw Data
Traditional Autoencoder
Induce abstraction through compression
෡X

23
Denoising Autoencoder
Induce abstraction through reconstruction
෡X
Corrupted
Raw Data
X
Raw Data

24
X
Raw Data
X1
෢X2
Induce abstraction through prediction
Cross-Channel Encoder
c.f. Larsson et al. Colorization as a Proxy Task for Visual Understanding. In CVPR, 2017.

25
Zhang, Isola, Efros. Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction. In CVPR, 2017.
c.f. LeCun, Denker, Solla. Optimal Brain Damage. In NIPS, 1989.
෡X
X1
X2
෢X1
෢X2
X
Split-Brain Autoencoder

Input Image X Predicted Image ෡X
26
Split-Brain Autoencoder on Images

Input
RGB-HHA
image
RGB channels
HHA depth channels
HHA depth channels
RGB channels
Predicted
RGB-HHA
image
27
Split-Brain Autoencoder on RGB-D

Class supervision
Pre-Trained Model
How semantically organized is
the feature space? Top-1ClassificationAccuracy
40
35
30
25
20
15
10
5
0
45
50
Layer
conv1 pool1 conv2 pool2 conv3 conv4 conv5 pool5
227
conv1 conv2 conv3 conv4 conv5
27
13
96
256
384
13
384
13
256
Random
Supervised pre-training
Krizhevsky et al. In NIPS, 2012.

Performance Comparison
X ෡X
X1
X2
෢X1
෢X2
29
෡XX
෡XX
෢X2
X
X1
Top-1ClassificationAccuracy
40
35
30
25
20
15
10
5
0
45
50
Random
Supervised pre-training
Krizhevsky et al. In NIPS, 2012.
Layer
conv1 pool1 conv2 pool2 conv3 conv4 conv5 pool5
Doersch et al. In ICCV, 2015.
Donahue et al. In ICLR, 2017.Pathak et al. In CVPR, 2016.
Krähenbühl et al. In ICLR, 2016.
Colorization
No split
Corrupted input
(Regression
)
Accounting for multimodality
improves representation

31
Task & Dataset Generalization
Does the feature representation transfer
to other tasks and datasets?
Berkeley-Adobe Perceptual
Patch Similarity Dataset
Zhang, Isola, Efros, Shechtman, Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR, 2018.
D ( , )
Rock beauty
Classification Perceptual Judgments

32
Zhang, Isola, Efros, Shechtman, Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR, 2018.
How different are these patches?

Which patch is more similar to the middle?
Humans
L2/PSNR
SSIM/FSIMc
Deep Networks?
< Clap >< Clap >
33

“Perceptual Losses”
34
Gatys et al. In CVPR, 2016.
Johnson et al. In ECCV, 2016.
Dosovitskiy and Brox. In NIPS, 2016.
Chen and Koltun. In ICCV, 2017.

Deep Networks as a Perceptual Metric
𝐹 𝐹
Normalize,
Subtract
L2 norm,
Spatial average
𝑥 𝑥0
Avg
𝑑0
(1) How well do “perceptual losses” describe perception?
(2) Does it have to be VGG trained on classification?
c.f. Gatys et al. CVPR 2016. Johnson et al. ECCV 2016. Dosovitskiy and Brox. NIPS 2016.
35

Original Patch
(1) Traditional Distortions
Distorted Patches
Noise
Photometric
Spatial warps
Compression
Blur
36

Distorted Patches
Original Patch
Goal: Imitate CNN artifacts from real algorithms
(2) CNN-Based Distortions
𝐺
Trained with random:
- Input corruption
- Generator architecture
- Discriminator architecture
- Loss function
37

38
Which patch is more similar to the middle?

40
% agreement with
human judges
Bigger/Deeper ≠ Better
Networks perform strongly across supervisory
signals and architectures
Fitting some data
is important
82.6
68.9
Low-level
AlexNet (Random)
AlexNet (Unsupervised)
AlexNet (Self-supervised)
Nets (Supervised -
Imagenet classification)
Human
75.7
76.8
78.0
76.876.4
75.5
74.8
70.6
70.069.7
VGG on classification
(“perceptual loss”) correlates well
Can we train on perceptual judgments?

Traditional
Distortion Types
41
Hand-defined
functions
Original
Patch
Distorted
Patches
CNN-Based
Distorted
Patches
Original
Patch
Family of
Generators
Real Algorithm Outputs
Training distribution Held-out test distribution

Original Patch
Tasks: superresolution, colorization,
frame interpolation, video deblurring
Real Algorithm Outputs
Downsample
+
Superressolution
43

Training a Perceptual Metric
𝐹 𝐹
Normalize,
Subtract
L2 norm,
Spatial average
𝑥 𝑥0
Avg
𝑑0
c.f. Gatys et al. CVPR 2016. Johnson et al. ECCV 2016. Dosovitskiy and Brox. NIPS 2016.
𝑤
Off-the-shelf (w=1)
Frozen (w learned)
Tuned (w learned)

Off-the-shelf networks already perform well
Training a linear layer on top yields small performance boost.
Fine-tuning through representation leads to overfitting
Training distribution
82.6
76.8
80.6
Held-out distribution
69.5
65.0
64.3
Training a Perceptual Metric
Off-the-shelf Tuned Human Off-the-shelf Tuned Human
78.7
Frozen
65.3
Frozen
“LPIPS” metric: richzhang.github.io/PerceptualSimilarity
Additionally
- Ensembled-LPIPS. Kettunen, Härkönen, Lehtinen. ArXiv 2019.
- Audio domain. Manocha, Finkelstein, Jin, Bryan, Zhang, Mysore. In progress.

Example classifications
P(correct class) P(correct class)

Deep Networks are not Shift-Invariant

Azulay and Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? In ArXiv, 2018.
Engstrom, Tsipras, Schmidt, Madry. Exploring the Landscape of Spatial Robustness. In ICML, 2019.

Azulay and Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? In ArXiv, 2018.
Engstrom, Tsipras, Schmidt, Madry. A rotation and a translation suffice: Fooling cnns with simple transformations. In ArXiv,

Making Convolutional Networks
Shift-Invariant Again
Richard Zhang
In ICML, 2019.

Why is shift-invariance lost?
“Convolutions are shift-equivariant”
“Pooling builds up shift-invariance”
…but striding ignores Nyquist sampling theorem
and aliases
51

55
Re-examining Max-Pooling
max

56
max

57
max

58
max

59
max

60
max

61
max

62
max

64
max

65
max

66
max

67
max

68
max

69
max

73
Max-pooling breaks shift-equivariance

Shift-equivariance Testbed
• CIFAR
• VGG network
• 5 max-pools
• Test shift-equivariance condition
•
• Circular convolution/shift
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
32x32
1x1

Perfect shift-eq.
Large deviation
from shift-eq.
75
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Shift-equivariance, per layer
Convolution is shift-equivariant

Perfect shift-eq.
Large deviation
from shift-eq.
76
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Pooling breaks shift-equivariance

77
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Perfect shift-eq.
Large deviation
from shift-eq.

pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Perfect shift-eq.
Large deviation
from shift-eq.

Perfect shift-eq.
Large deviation
from shift-eq.
pixels
conv1
pool1
conv2
pool2
conv3
pool3
conv4
pool4
conv5
pool5
classifier
softmax
Every pooling increases periodicity

Alternative downsampling methods
• Blur+subsample
• Antialiasing in signal processing; image processing; graphics
• Max-pooling
• Performs better in deep learning applications [Scherer 2010]
88

• Blur+subsample
• Max-pooling
• Performs better for deep learning [Scherer 2010]
89

• Blur+subsample
• Max-pooling
• Performs better for deep learning [Scherer 2010]
90
Reconcile antialiasing with max-pooling

Max (densely)
Preserves shift-equivariance
max( )
max( )
Anti-aliased pooling (MaxBlurPool)
Shift-equivariance lost; heavy aliasing
max( )
Strided-MaxPool
max( )
Blur
Preserves shift-eq.
Blur
Equivalent Interpretation
Shift-eq. lost; heavy aliasing
Shift eq. lost, but reduced aliasing
Subsampling
Evaluated together as “BlurPool”

93
(2) Blur+Subsample
Shift-equivariance better preserved

Anti-aliased
Perfect
shift-eq.
Large
deviation

Baseline
Perfect
shift-eq.
Large
deviation

MaxPool
(stride 2)
Max
(stride 1)
BlurPool
(stride 2)
Max Pooling
(VGG, AlexNet)
Conv
(stride 2)
ReLU
Conv
(stride 1)
ReLU BlurPool
(stride 2)
Strided-Convolution
(ResNet, MobileNetv2)
Antialiasing any downsampling layer
AvgPool
(stride 2)
BlurPool
(stride 2)
Average Pooling
(DenseNet)

ImageNet
Shift-invariance
Accuracy

ImageNet
Shift-invariance
Accuracy
Baseline
Antialiased

ImageNet
Shift-invariance
Accuracy
Antialiasing also
improves accuracy
Baseline
Antialiased

Image-to-Image Translation
103
Input Baseline Antialias [1 1] Antialias [1 2 1]
Isola et al. Pix2pix. CVPR 2017.

Striding aliases(stride=2)
Add antialiasing filter
+ shift-equivariance
+ accuracy
Additionally
+ stability to other perturbations
+ robustness to corruptions
104
Discussion

Discriminative Learning
105
“Rock beauty”
Semantic
Labels
+ Solve discriminative tasks
+ Learn about the visual world + Force the network to learn about the
visual world for free
Raw Unlabeled
Data
Image Synthesis
+ Solve graphics tasks
+ Engineering in inductive biases (shift-invariance) still valuable

Computer Vision · Graphics · Deep Learning · Machine Learning
Human Computer Interaction · Natural Language Processing
San Jose San Francisco Seattle

Thank You!
Anti-aliasing CNNs
Zhang. In ICML, 2019.
richzhang.github.io/antialiased-cnns
Perceptual Similarity
Zhang, Isola, Efros, Shechtman, Wang.
In CVPR, 2018.
richzhang.github.io/PerceptualSimilarity
Zhang, Isola, Efros. In CVPR, 2017.
richzhang.github.io/splitbrainauto
Colorization
Zhang, Isola, Efros. ECCV, 2016.
Zhang*, Zhu*, et al. SIGGRAPH, 2017.
richzhang.github.io/colorization
richzhang.github.io/ideepcolor
BicycleGAN
Zhu, Zhang, Pathak, Darrell, Efros,
Wang, Shechtman. In NIPS, 2017.
junyanz.github.io/BicycleGAN
Detecting Photoshop.
Wang, Wang, Owens, Zhang, Efros.
In ICCV, 2019.
peterwang512.github.io/FALdetector

Train without Data Augmentation Train with Data Augmentation
Shift-Invariance vs Classification Accuracy
Data augmentation increases both
accuracy and shift-invariance
Engineering in shift-invariance is
“free” data augmentation
Boosts shift-invariance while
maintaining accuracy

Our Distortions Real Algorithm Outputs
Human
L2/PSNR
SSIM
FSIMc
Gaussian
K-Means
Watch Obj.
Split-Brain
Puzzle
BiGAN
SqueezeNet
AlexNet
VGG
Low-level
Net (Random)
Net (Unsupervised)
Net (Self-supervised)
Net (Supervised)
Human
Percep-Trained (Frozen)
Percep-Trained (Tuned)
Near human-level performance
within our distortions
Linear “calibration” on our
distortions transfers successfully
Training on direct task is not
the complete solution
115

Failure Cases
– Method can make mistakes
116

Dorothea Lange, Migrant Mother, 1936.
Dorothea Lange.
Migrant Mother, 1936.
Library of Congress, Prints & Photographs
Division, FSA/OWI Collection, reproduction
number: LC-USF34-9058-C
?
?
?
?
117

Grayscale Input Larsson et al.
In ECCV, 2016.
Iizuka et al.
In SIGGRAPH, 2016.
Zhang, Isola, Efros.
In ECCV, 2016.
Automatic Results with Deep Networks
118Dorothea Lange. Migrant Mother, 1936.
Library of Congress, Prints & Photographs Division, FSA/OWI Collection, reproduction number: LC-USF34-9058-C

119
Dorothea Lange.
Zhang*, Zhu*, Isola, Geng, Lin, Yu, Efros. Real-Time User-Guided Image Colorization with Learned Deep Priors. In SIGGRAPH, 2017.

120
Dorothea Lange.

121
Dorothea Lange.

Modeling perceptual similarity and shift invariance in deep networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Modeling perceptual similarity and shift invariance in deep networks

Similar to Modeling perceptual similarity and shift invariance in deep networks (20)

More from NAVER Engineering

More from NAVER Engineering (20)

Recently uploaded

Recently uploaded (20)

Modeling perceptual similarity and shift invariance in deep networks