The neural tangent link between CNN denoisers and non-local filters

1
The neural tangent link between CNN
denoisers and non-local filters
Julián Tachella
School of Engineering
University of Edinburgh
J. Tachella, J. Tang and Mike Davies, arXiv:2006.02379, 2020.
Joint work with B. Tang and M. Davies

The deep learning era
2
CNNs offer state of the art image denoising, e.g.
restored image
K. Zhang, W. Zuo, Y. Chen, D. Meng and L. Zhang, "Beyond a Gaussian Denoiser:
Residual Learning of Deep CNN for Image Denoising," IEEE Trans Image Proc, 2017.

What is a CNN?
3
An 𝐿 layer vanilla CNN can be defined via the recursion
𝑎𝑖
1
= 𝑊𝑖
1
𝑦
𝑎𝑖
ℓ
=
𝑗=1
𝑐
𝑊𝑖,𝑗
ℓ
𝜙(𝑎𝑖
ℓ−1
)
𝑧 =
𝑗=1
𝑐
𝑊
𝑗
𝐿
𝜙(𝑎𝑖
𝐿−1
)
Weights 𝑤 adapted over clean training data, 𝑢, to minimize:
argmin𝑤 𝑧 𝑤 − 𝑢 2
2
restored image
Learning a very high-dimensional function 𝑧: ℝ𝑑
→ ℝ𝑑

4
How do CNNs work?
Do we really know?
If not, how can we design novel, better,
faster solutions?
Today:
What can be trained using a practical
network and practical learning
algorithms?

5
Deep image prior [Ulyanov et al., 2018]: denoising without training data
Noise2Self [Batson, 2019]
Corrupted target
White noise
CNN
Autoencoder architecture
The Deep Image Prior
Minimize “self-supervised” loss: 𝑧 𝑤 − 𝑦 2
2
Corrupted target

The big mystery
7
Deep image prior [Ulyanov et al., 2018]
Network: 2M parameters
Single target image: 49k pixels
“self-supervised” loss: 𝑧 𝑤 − 𝑦 2
2
Global Minima has roughly 1.95 M dimensions..
Early-stopping consistently provides SOTA?!
[Liu et al., 2020]

Rethinking CNN Denoising
8
Classical algorithms (BM3D, NLM) also rely on a single corrupted image
They provide similar results than DIP…
Is there any link between them?
BM3D
32.8 dB
31.5 dB

Patch-based methods
9
Patch based methods as Global (kernel) filters [Milanfar, 2012,2014]
Corrupted image 𝑦 ∈ ℝ𝑑
Filter matrix 𝑊 = diag(
1
1𝑇𝐾
)𝐾 with 𝐾𝑖,𝑗 = 𝑘 𝑦𝑖, 𝑦𝑗
e.g., non-local means
𝑘NLM 𝑦𝑖, 𝑦𝑗 = 𝑒
−
1
2𝜎2 𝑦𝑃𝑖−𝑦𝑃𝑗
2
2
where 𝑦𝑃𝑗 denotes the patch centred at 𝑦𝑗.
𝑦𝑃𝑖
𝑦𝑃𝑗

10
• Standard filtering: 𝑧 = 𝑊𝑦
• We can also iterate the solution (twicing):
𝑧𝑡+1 = 𝑧𝑡 + 𝑊 𝑦 − 𝑧𝑡
Eigendecomposition: 𝑊 = VΣ𝑉𝑇
𝑧𝑡 = 𝑖=1
𝑑
1 − 1 − 𝜆𝑖
𝑡 (𝑣𝑖
𝑇
𝑦) 𝑣𝑖
MSE = 𝑖=1
𝑑
1 − 𝜆𝑖
2𝑡
bias2
+ 1 − 1 − 𝜆𝑖
𝑡 2
variance
Early-stop for best trade-off
Patch-based methods

11
Patch-based methods
[Milanfar, 2012]
Best denoising if the eigenvalue decay is sharp

Training dynamics
12
CNN defined as 𝑧 𝑥, 𝑤 : ℝ𝑑
× ℝ𝑝
→ ℝ𝑑
Loss ℒ 𝑤 = 𝑧 𝑤 − 𝑦 2
2
and GD training 𝑤𝑡+1
= 𝑤𝑡
− 𝜂
𝜕ℒ
𝜕𝑤
Taylor expansion in terms of initial weights 𝑧 𝑤 ≈ 𝑧 𝑤 = 𝑧 𝑤0 +
𝜕𝑧
𝜕𝑤
(𝑤 − 𝑤0)
𝑧𝑘+1
= 𝑧𝑘
+ 𝜂
𝜕𝑧
𝜕𝑤
𝜕𝑧
𝜕𝑤
𝑇
(𝑦 − 𝑧𝑘
)
NTK 𝜂Θ ∈ PSD𝑑
plays the role of filter matrix W in twicing!
𝑧𝑘+1
= 𝑧𝑘
+ 𝑊(𝑦 − 𝑧𝑘
)

13
𝜂Θ𝑖,𝑗 = 𝜂
𝜕𝑧𝑖
𝜕𝑤
𝜕𝑧𝑗
𝜕𝑤
𝑇
=
ℓ=1
𝐿
𝑘=1
𝑐
𝜂
𝜕𝑧𝑖
𝜕𝑤𝑘
ℓ
𝜕𝑧𝑗
𝜕𝑤𝑘
ℓ
A closer look to the neural tangent kernel
Assume:
1. Overparameterization: channels of hidden layers 𝑐 → ∞
2. Standard iid initialization of weights with variance ∝ 𝑐−1
(He, LeCun or Glorot initialization)
3. Correct learning rate 𝜂 ∝ 𝑐−1
to avoid divergent dynamics
concentrates around its mean as 𝑶(𝒄−𝟎.𝟓)

Neural Tangent Kernel
14
1. Each individual weight changes very slightly, less in hidden layers
2. The total change is vanishingly small sup
𝑡
𝑤𝑡
− 𝑤0
2
= 𝒪 𝑐−0.5
(hence Taylor expansion)
3. The preactivations at each layer 𝑎ℓ
∼ 𝒩(0, Σ𝑎ℓ) do not change significantly during training
4. Filters are random
sup
𝑡
|𝑤ℓ,𝑖
𝑡
− 𝑤ℓ,𝑖
0
| =
𝒪(𝑐−1) if ℓ = 𝐿 (last)
𝒪(𝑐−3/2
) otherwise
𝑎1
𝑎2
𝑥 𝑧

Neural Tangent Kernel
15
NTK theory
1. NTK 𝜂Θ is fixed throughout training
2. It is fully characterized by the architecture, random initialization and input statistics
3. Linear dynamics describe well the evolution of the network [Lee et al., 2019]
4. NTK can be computed in closed form!

CNN kernel
16
Each CNN block can be associated with an operator PSD𝑑 → PSD𝑑
1. Input: 𝑊 = 𝑥𝑥𝑇
2. Convolution layer: 𝒜 W 𝑖,𝑗 = 𝑊𝑖′,𝑗′ patch size = convolution kernel size
3. Upsampling and downsampling: 𝑊′
= 𝑈𝑊𝑈𝑇
where 𝑈 is the up(down)sampling matrix
4. Non-linearity: 𝑉 𝑊 = 𝔼𝑥∼𝒩(0,𝑊){ 𝜙 𝑥 𝜙 𝑥 𝑇
} and 𝑉′ 𝑊 = 𝔼𝑥∼𝒩 0,𝑊 { 𝜙′ 𝑥 𝜙′ 𝑥 𝑇
}
e.g., relus 𝑉 𝑊 𝑖,𝑗 = 𝑊𝑖,𝑖𝑊
𝑗,𝑗 sin 𝜑 + 𝜋 − 𝜑 cos 𝜑
𝑉′ 𝑊 𝑖,𝑗 = 1 −
𝜑
𝜋
with 𝜑 = acos
𝑊𝑖,𝑗
𝑊𝑖,𝑖𝑊𝑗,𝑗

Evaluated for 𝑥𝑃𝑖 = 𝑥𝑃𝑗 = 1
and using 𝜎 = 1 in NLM kernel
CNN kernel
17
A simple example: 1 hidden layer CNN
𝑟 × 𝑟 convolutions NTK = non-local 𝑟 × 𝑟 patch-based filter
𝜂Θ 𝑖,𝑗 = 𝑘CNN 𝑥𝑖, 𝑥𝑗 =
𝑥𝑃𝑖 |𝑥𝑃𝑗|
𝜋
sin 𝜑 + 𝜋 − 𝜑 cos 𝜑
where 𝜑 is the angle between patches, and 𝑥𝑃𝑖 is the patch centred at
𝑥𝑖

18
CNN NTK with noisy house input:
The kernel matrix exhibits very fast eigenvalue decay = degrees of freedom of linear smoother
CNN kernel
Input image
CNN

Nystrom denoising
19
CNN trained with GD
Execution time 800 s
Nystrom 1% of pixels
Execution time 3 s Mystery solved, we can go to the beach now
We can directly compute 𝜼𝚯 to do the filtering!
𝜂Θ is of size 𝑑 × 𝑑 prohibitive complexity for large images!
• Low-rank Nystrom approximation of 𝜂Θ using 𝑚 ≪ 𝑑 columns [Milanfar 2014]
• Only computing correlation with 1% of the patches is enough!

Noise input
21
The DIP inputs iid noise 𝑥 ∼ 𝒩(0, 1), not the corrupted image!
 resulting filter does not depend on the image in any way…
Even worse, for a vanilla CNN we get 𝜂Θ 𝑖,𝑗 = 1
𝑑
1 if 𝑖 = 𝑗
0.25 otherwise
This cannot be obtained via low-pass filtering!
The DIP does not use GD, but Adam
[Heckel, 2020] for a U-Net CNN
the NTK is low-pass
DIP Smoothing kernels
[Cheng 2019]

Adam optimizer
22
The Adam optimizer belongs to the family of adaptive gradient methods (Adagrad, RMSProp, etc)
𝑤𝑡+1 = 𝑤𝑡 − 𝜂𝐻𝑡
𝜕ℒ
𝜕𝑤
No running averages = sign gradient descent
1. The metric depends on the step size [Gunasekar et al., 2018]
2. Hidden layers have a larger change than in GD!
3. Dynamics not well described by Taylor around initialization
𝐻𝑡 running averages of squared
gradient
sup
𝑡
𝑤ℓ,𝑖
𝑡
− 𝑤ℓ,𝑖
0
= 𝒪 𝑐−1
∀ℓ
sup
𝑡
𝑤𝑡 − 𝑤0
2
= 𝒪 1
sign(
𝜕ℒ
𝜕𝑤
)

Adam optimizer
23
Adaptive filtering: The NTK is not fixed throughout training
𝑧𝑡+1
= 𝑧𝑡
+ 𝜂Θ𝑡
(𝑦 − 𝑧𝑡
)
At initialization: the matrix adapts using non-local information about the image residual
𝜕ℒ
𝜕𝑎𝐿 = 𝛿𝐿
= 𝑦 − 𝑧𝑡
through back propagation
 Pre-activations 𝑎ℓ
no longer remain constant - change the proportionally to 𝛿ℓ
which adapt through
back propagated nonlocal filters
𝑎1
𝑎2
𝛿𝐿
𝑥 𝑧
𝛿2
𝛿1
with Θ𝑡 =
𝜕𝑧
𝜕𝑤
𝐻𝑡 𝜕𝑧
𝜕𝑤
𝑇

Experiments
24
We evaluate
 U-Net architecture (8 hidden layers)
 Autoencoder (no skip connections)
 Single-hidden layer CNN
Corrupted target
White noise
CNN
Case 1: (DIP)
Corrupted target
Input image
CNN
Case 2: (c.f. Noise2Self)

25
Peak-signal-to-noise ratio (PSNR) results on standard denoising dataset

Experiments
27
Autoencoder, input noise, Adam and GD training vs number of channels
overparameterized
regime

Experiments
28
Autoencoder, input noise, Adam and GD training vs number of channels
𝑂(1)
𝑂(𝑐−0.5)
𝑂(𝑐−1
)
𝑂(𝑐−1.5
)
Hence each weight makes a similar small
contribution (in contrast to convolutional
sparse coding model)

29
Experiments
Leading eigenvectors of the last hidden preactivations

Conclusions
30
 CNNs can be seen as exploiting some form of nonlocal filter structure
 Hence CNNs have a very strong bias towards clean images
 Effective degrees of freedom ≪ parameters in the network
 Use Nystrom to avoid training 2M parameters
 Optimizer plays key role
Future work
 Fast approximations of other CNNs?
 Learn better image models from CNNs?
 Extend to more general imaging inverse problems
 Understanding Adam training dynamics (and hence the DIP)

Thanks for your attention!
Tachella.github.io
 Codes
 Presentations
 … and more
31

The neural tangent link between CNN denoisers and non-local filters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The neural tangent link between CNN denoisers and non-local filters

Similar to The neural tangent link between CNN denoisers and non-local filters (20)

More from Julián Tachella

More from Julián Tachella (7)

Recently uploaded

Recently uploaded (20)

The neural tangent link between CNN denoisers and non-local filters

Editor's Notes