The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The role of overparameterization and optimization in CNN denoisers
1. 1
The role of overparameterization and
optimization in CNN denoisers
Julián Tachella, Junqi Tang and Mike Davies
School of Engineering
University of Edinburgh
TOPML Conference
https://arxiv.org/abs/2006.02379
2. The deep learning era
2
CNNs offer state of the art image denoising
• Not well understood…
• Can we use the neural tangent kernel [Jacot et al.] insight to improve our understanding?
restored image
DnCNN [Zhang et al., 2017]
3. The big mystery
3
Deep image prior [Ulyanov et al., 2018]
Network: 2M parameters
Single target image: 49k pixels
“Self-supervised” loss: 𝑧 𝑤 − 𝑦 2
2
Early-stopping consistently provides SOTA
No explicit regularization, highly overparameterized!
5. Image input
5
Assuming overparameterization + GD training
• NTK = non-local filter (e.g. non-local means)
• Patch similarity function in closed form
• Training = iterative image denoising (twicing)
• Leading eigenvectors capture ‘clean’ signal
• Efficient low-rank Nystrom approximation of 𝜂Θ
GD-trained CNN
Time 800 s
Nystrom closed form NTK
Time 3 s
7. 7
NTK theory: Filter does not depend on the image in any way…
Vanilla CNN:
This cannot be obtained via low-pass filtering!
The DIP does not use GD, but Adam
Noise input
NTK is low-pass [Heckel, 2020]
DIP Smoothing kernels [Cheng 2019]
Autoencoder:
filter = 1
𝑑
Not well described by NTK as
sup
𝑡
𝑤𝑡 − 𝑤0
2
= 𝒪 1
10. Thanks for your attention!
Tachella.github.io
Codes
Presentations
10
To appear at CVPR 2021
https://arxiv.org/abs/2006.02379
Editor's Notes
State of the Art Image denoisers are CNNs
Seem to require many weights (500k - 2M)
Lots of training data (so are they susceptible to domain shift?)
And lots of training time
Is this correct? What exactly do they learn?
That is the big mystery
The CNN has 2M parameters compared to the images 50K pixels
That means the cost function typicaly has a zero error global minima set of ~1.95M dimensions…. Lots of solutions
…all of which just output the original noisy image
.. So how come early stopping of training consistently provides SOTA performance
… and not just in denoising is other image processing problems too such as inpainting
As a simple example let’s consider a CNN with a single hidden layer.
The rxr convolutions and nonlinearity calculate a non-local patch based affinity matrix with the following kernel
While this patch based similarity metric is different to that of say NLM, if we normalise the patches we can see that it has a very similar form..
(though I have chosen parameters judiciously to maximise the similarity)
Ultimately this means we can directly compute the NTK and do filtering
If we now look at what the images looked like we in particular see that GD training with noise as input acts as a crude LPF (particularly for the U-net)
In the other cases: noise+Adam or image+GD/Adam we have a good estimate and certainly have not experienced LPF – all images preserve detail structure
(Sigma = 25, PSNR = 20.18, CBM3D = 33.03)
Similarly when we look at the change in weights we see broadly what we have predicted…
First the L2 change in weights for Adam is order 1 hence not in the NTK regime
In contrast for GD weight change decays roughly as C^{-0.5}
Next looking at the l_infinity norm of the weights we see that individualy all weights have a change that decays with the number of channels, suggesting that each weight provides a similar small contribution to the solution (in contrast with convolutional sparse coding arguments where only a few weights contribute significantly