SlideShare a Scribd company logo
1
The neural tangent link between CNN
denoisers and non-local filters
Julián Tachella
School of Engineering
University of Edinburgh
J. Tachella, J. Tang and Mike Davies, arXiv:2006.02379, 2020.
Joint work with B. Tang and M. Davies
The deep learning era
2
CNNs offer state of the art image denoising, e.g.
restored image
K. Zhang, W. Zuo, Y. Chen, D. Meng and L. Zhang, "Beyond a Gaussian Denoiser:
Residual Learning of Deep CNN for Image Denoising," IEEE Trans Image Proc, 2017.
What is a CNN?
3
An 𝐿 layer vanilla CNN can be defined via the recursion
𝑎𝑖
1
= 𝑊𝑖
1
𝑦
𝑎𝑖
ℓ
=
𝑗=1
𝑐
𝑊𝑖,𝑗
ℓ
𝜙(𝑎𝑖
ℓ−1
)
𝑧 =
𝑗=1
𝑐
𝑊
𝑗
𝐿
𝜙(𝑎𝑖
𝐿−1
)
Weights 𝑤 adapted over clean training data, 𝑢, to minimize:
argmin𝑤 𝑧 𝑤 − 𝑢 2
2
restored image
Learning a very high-dimensional function 𝑧: ℝ𝑑
→ ℝ𝑑
4
How do CNNs work?
Do we really know?
If not, how can we design novel, better,
faster solutions?
Today:
What can be trained using a practical
network and practical learning
algorithms?
5
Deep image prior [Ulyanov et al., 2018]: denoising without training data
Noise2Self [Batson, 2019]
Corrupted target
White noise
CNN
Autoencoder architecture
The Deep Image Prior
Minimize “self-supervised” loss: 𝑧 𝑤 − 𝑦 2
2
Corrupted target
The big mystery
6
The big mystery
7
Deep image prior [Ulyanov et al., 2018]
Network: 2M parameters
Single target image: 49k pixels
“self-supervised” loss: 𝑧 𝑤 − 𝑦 2
2
Global Minima has roughly 1.95 M dimensions..
Early-stopping consistently provides SOTA?!
[Liu et al., 2020]
Rethinking CNN Denoising
8
Classical algorithms (BM3D, NLM) also rely on a single corrupted image
They provide similar results than DIP…
Is there any link between them?
BM3D
32.8 dB
31.5 dB
Patch-based methods
9
Patch based methods as Global (kernel) filters [Milanfar, 2012,2014]
Corrupted image 𝑦 ∈ ℝ𝑑
Filter matrix 𝑊 = diag(
1
1𝑇𝐾
)𝐾 with 𝐾𝑖,𝑗 = 𝑘 𝑦𝑖, 𝑦𝑗
e.g., non-local means
𝑘NLM 𝑦𝑖, 𝑦𝑗 = 𝑒
−
1
2𝜎2 𝑦𝑃𝑖−𝑦𝑃𝑗
2
2
where 𝑦𝑃𝑗 denotes the patch centred at 𝑦𝑗.
𝑦𝑃𝑖
𝑦𝑃𝑗
10
• Standard filtering: 𝑧 = 𝑊𝑦
• We can also iterate the solution (twicing):
𝑧𝑡+1 = 𝑧𝑡 + 𝑊 𝑦 − 𝑧𝑡
Eigendecomposition: 𝑊 = VΣ𝑉𝑇
𝑧𝑡 = 𝑖=1
𝑑
1 − 1 − 𝜆𝑖
𝑡 (𝑣𝑖
𝑇
𝑦) 𝑣𝑖
MSE = 𝑖=1
𝑑
1 − 𝜆𝑖
2𝑡
bias2
+ 1 − 1 − 𝜆𝑖
𝑡 2
variance
Early-stop for best trade-off
Patch-based methods
11
Patch-based methods
[Milanfar, 2012]
Best denoising if the eigenvalue decay is sharp
Training dynamics
12
CNN defined as 𝑧 𝑥, 𝑤 : ℝ𝑑
× ℝ𝑝
→ ℝ𝑑
Loss ℒ 𝑤 = 𝑧 𝑤 − 𝑦 2
2
and GD training 𝑤𝑡+1
= 𝑤𝑡
− 𝜂
𝜕ℒ
𝜕𝑤
Taylor expansion in terms of initial weights 𝑧 𝑤 ≈ 𝑧 𝑤 = 𝑧 𝑤0 +
𝜕𝑧
𝜕𝑤
(𝑤 − 𝑤0)
𝑧𝑘+1
= 𝑧𝑘
+ 𝜂
𝜕𝑧
𝜕𝑤
𝜕𝑧
𝜕𝑤
𝑇
(𝑦 − 𝑧𝑘
)
NTK 𝜂Θ ∈ PSD𝑑
plays the role of filter matrix W in twicing!
𝑧𝑘+1
= 𝑧𝑘
+ 𝑊(𝑦 − 𝑧𝑘
)
13
𝜂Θ𝑖,𝑗 = 𝜂
𝜕𝑧𝑖
𝜕𝑤
𝜕𝑧𝑗
𝜕𝑤
𝑇
=
ℓ=1
𝐿
𝑘=1
𝑐
𝜂
𝜕𝑧𝑖
𝜕𝑤𝑘
ℓ
𝜕𝑧𝑗
𝜕𝑤𝑘
ℓ
A closer look to the neural tangent kernel
Assume:
1. Overparameterization: channels of hidden layers 𝑐 → ∞
2. Standard iid initialization of weights with variance ∝ 𝑐−1
(He, LeCun or Glorot initialization)
3. Correct learning rate 𝜂 ∝ 𝑐−1
to avoid divergent dynamics
concentrates around its mean as 𝑶(𝒄−𝟎.𝟓)
Neural Tangent Kernel
14
1. Each individual weight changes very slightly, less in hidden layers
2. The total change is vanishingly small sup
𝑡
𝑤𝑡
− 𝑤0
2
= 𝒪 𝑐−0.5
(hence Taylor expansion)
3. The preactivations at each layer 𝑎ℓ
∼ 𝒩(0, Σ𝑎ℓ) do not change significantly during training
4. Filters are random
sup
𝑡
|𝑤ℓ,𝑖
𝑡
− 𝑤ℓ,𝑖
0
| =
𝒪(𝑐−1) if ℓ = 𝐿 (last)
𝒪(𝑐−3/2
) otherwise
𝑎1
𝑎2
𝑥 𝑧
Neural Tangent Kernel
15
NTK theory
1. NTK 𝜂Θ is fixed throughout training
2. It is fully characterized by the architecture, random initialization and input statistics
3. Linear dynamics describe well the evolution of the network [Lee et al., 2019]
4. NTK can be computed in closed form!
CNN kernel
16
Each CNN block can be associated with an operator PSD𝑑 → PSD𝑑
1. Input: 𝑊 = 𝑥𝑥𝑇
2. Convolution layer: 𝒜 W 𝑖,𝑗 = 𝑊𝑖′,𝑗′ patch size = convolution kernel size
3. Upsampling and downsampling: 𝑊′
= 𝑈𝑊𝑈𝑇
where 𝑈 is the up(down)sampling matrix
4. Non-linearity: 𝑉 𝑊 = 𝔼𝑥∼𝒩(0,𝑊){ 𝜙 𝑥 𝜙 𝑥 𝑇
} and 𝑉′ 𝑊 = 𝔼𝑥∼𝒩 0,𝑊 { 𝜙′ 𝑥 𝜙′ 𝑥 𝑇
}
e.g., relus 𝑉 𝑊 𝑖,𝑗 = 𝑊𝑖,𝑖𝑊
𝑗,𝑗 sin 𝜑 + 𝜋 − 𝜑 cos 𝜑
𝑉′ 𝑊 𝑖,𝑗 = 1 −
𝜑
𝜋
with 𝜑 = acos
𝑊𝑖,𝑗
𝑊𝑖,𝑖𝑊𝑗,𝑗
Evaluated for 𝑥𝑃𝑖 = 𝑥𝑃𝑗 = 1
and using 𝜎 = 1 in NLM kernel
CNN kernel
17
A simple example: 1 hidden layer CNN
𝑟 × 𝑟 convolutions NTK = non-local 𝑟 × 𝑟 patch-based filter
𝜂Θ 𝑖,𝑗 = 𝑘CNN 𝑥𝑖, 𝑥𝑗 =
𝑥𝑃𝑖 |𝑥𝑃𝑗|
𝜋
sin 𝜑 + 𝜋 − 𝜑 cos 𝜑
where 𝜑 is the angle between patches, and 𝑥𝑃𝑖 is the patch centred at
𝑥𝑖
18
CNN NTK with noisy house input:
The kernel matrix exhibits very fast eigenvalue decay = degrees of freedom of linear smoother
CNN kernel
Input image
CNN
Nystrom denoising
19
CNN trained with GD
Execution time 800 s
Nystrom 1% of pixels
Execution time 3 s Mystery solved, we can go to the beach now
We can directly compute 𝜼𝚯 to do the filtering!
𝜂Θ is of size 𝑑 × 𝑑 prohibitive complexity for large images!
• Low-rank Nystrom approximation of 𝜂Θ using 𝑚 ≪ 𝑑 columns [Milanfar 2014]
• Only computing correlation with 1% of the patches is enough!
Noise input
21
The DIP inputs iid noise 𝑥 ∼ 𝒩(0, 1), not the corrupted image!
 resulting filter does not depend on the image in any way…
Even worse, for a vanilla CNN we get 𝜂Θ 𝑖,𝑗 = 1
𝑑
1 if 𝑖 = 𝑗
0.25 otherwise
This cannot be obtained via low-pass filtering!
The DIP does not use GD, but Adam
[Heckel, 2020] for a U-Net CNN
the NTK is low-pass
DIP Smoothing kernels
[Cheng 2019]
Adam optimizer
22
The Adam optimizer belongs to the family of adaptive gradient methods (Adagrad, RMSProp, etc)
𝑤𝑡+1 = 𝑤𝑡 − 𝜂𝐻𝑡
𝜕ℒ
𝜕𝑤
No running averages = sign gradient descent
1. The metric depends on the step size [Gunasekar et al., 2018]
2. Hidden layers have a larger change than in GD!
3. Dynamics not well described by Taylor around initialization
𝐻𝑡 running averages of squared
gradient
sup
𝑡
𝑤ℓ,𝑖
𝑡
− 𝑤ℓ,𝑖
0
= 𝒪 𝑐−1
∀ℓ
sup
𝑡
𝑤𝑡 − 𝑤0
2
= 𝒪 1
sign(
𝜕ℒ
𝜕𝑤
)
Adam optimizer
23
Adaptive filtering: The NTK is not fixed throughout training
𝑧𝑡+1
= 𝑧𝑡
+ 𝜂Θ𝑡
(𝑦 − 𝑧𝑡
)
At initialization: the matrix adapts using non-local information about the image residual
𝜕ℒ
𝜕𝑎𝐿 = 𝛿𝐿
= 𝑦 − 𝑧𝑡
through back propagation
 Pre-activations 𝑎ℓ
no longer remain constant - change the proportionally to 𝛿ℓ
which adapt through
back propagated nonlocal filters
𝑎1
𝑎2
𝛿𝐿
𝑥 𝑧
𝛿2
𝛿1
with Θ𝑡 =
𝜕𝑧
𝜕𝑤
𝐻𝑡 𝜕𝑧
𝜕𝑤
𝑇
Experiments
24
We evaluate
 U-Net architecture (8 hidden layers)
 Autoencoder (no skip connections)
 Single-hidden layer CNN
Corrupted target
White noise
CNN
Case 1: (DIP)
Corrupted target
Input image
CNN
Case 2: (c.f. Noise2Self)
25
Peak-signal-to-noise ratio (PSNR) results on standard denoising dataset
Experiments
26
Experiments
27
Autoencoder, input noise, Adam and GD training vs number of channels
overparameterized
regime
Experiments
28
Autoencoder, input noise, Adam and GD training vs number of channels
𝑂(1)
𝑂(𝑐−0.5)
𝑂(𝑐−1
)
𝑂(𝑐−1.5
)
Hence each weight makes a similar small
contribution (in contrast to convolutional
sparse coding model)
29
Experiments
Leading eigenvectors of the last hidden preactivations
Conclusions
30
 CNNs can be seen as exploiting some form of nonlocal filter structure
 Hence CNNs have a very strong bias towards clean images
 Effective degrees of freedom ≪ parameters in the network
 Use Nystrom to avoid training 2M parameters
 Optimizer plays key role
Future work
 Fast approximations of other CNNs?
 Learn better image models from CNNs?
 Extend to more general imaging inverse problems
 Understanding Adam training dynamics (and hence the DIP)
Thanks for your attention!
Tachella.github.io
 Codes
 Presentations
 … and more
31

More Related Content

What's hot

[CVPR2022, LongVersion] Online Continual Learning on a Contaminated Data Stre...
[CVPR2022, LongVersion] Online Continual Learning on a Contaminated Data Stre...[CVPR2022, LongVersion] Online Continual Learning on a Contaminated Data Stre...
[CVPR2022, LongVersion] Online Continual Learning on a Contaminated Data Stre...
Jihwan Bang
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
Sangwoo Mo
 
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
Hiroyuki KASAI
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
민재 정
 
Lec11: Active Contour and Level Set for Medical Image Segmentation
Lec11: Active Contour and Level Set for Medical Image SegmentationLec11: Active Contour and Level Set for Medical Image Segmentation
Lec11: Active Contour and Level Set for Medical Image Segmentation
Ulaş Bağcı
 
分子間力セミナー
分子間力セミナー分子間力セミナー
分子間力セミナーTsubasa Ito
 
The Perceptron and its Learning Rule
The Perceptron and its Learning RuleThe Perceptron and its Learning Rule
The Perceptron and its Learning Rule
Noor Ul Hudda Memon
 
Image-to-Image Translation pix2pix
Image-to-Image Translation pix2pixImage-to-Image Translation pix2pix
Image-to-Image Translation pix2pix
Yasar Hayat
 
Yolo v2 ai_tech_20190421
Yolo v2 ai_tech_20190421Yolo v2 ai_tech_20190421
Yolo v2 ai_tech_20190421
穗碧 陳
 
学習係数
学習係数学習係数
学習係数
hoxo_m
 
Lecture 3 image sampling and quantization
Lecture 3 image sampling and quantizationLecture 3 image sampling and quantization
Lecture 3 image sampling and quantization
VARUN KUMAR
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
abhishek upadhyay
 
Back propagation using sigmoid & ReLU function
Back propagation using sigmoid & ReLU functionBack propagation using sigmoid & ReLU function
Back propagation using sigmoid & ReLU function
Revanth Kumar
 
Physics-Informed Machine Learning
Physics-Informed Machine LearningPhysics-Informed Machine Learning
Physics-Informed Machine Learning
OmarYounis21
 
Camera calibration technique
Camera calibration techniqueCamera calibration technique
Camera calibration technique
Krzysztof Wegner
 
Towards Causal Representation Learning
Towards Causal Representation LearningTowards Causal Representation Learning
Towards Causal Representation Learning
Suyeong Park
 
ディジタル信号処理の課題解説 その3
ディジタル信号処理の課題解説 その3ディジタル信号処理の課題解説 その3
ディジタル信号処理の課題解説 その3noname409
 
Neural Processes
Neural ProcessesNeural Processes
Neural Processes
Sangwoo Mo
 
GoogLenet
GoogLenetGoogLenet
GoogLenet
KyeongUkJang
 
人工知能2018 強化学習の応用
人工知能2018 強化学習の応用人工知能2018 強化学習の応用
人工知能2018 強化学習の応用
Hirotaka Hachiya
 

What's hot (20)

[CVPR2022, LongVersion] Online Continual Learning on a Contaminated Data Stre...
[CVPR2022, LongVersion] Online Continual Learning on a Contaminated Data Stre...[CVPR2022, LongVersion] Online Continual Learning on a Contaminated Data Stre...
[CVPR2022, LongVersion] Online Continual Learning on a Contaminated Data Stre...
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
 
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
Lec11: Active Contour and Level Set for Medical Image Segmentation
Lec11: Active Contour and Level Set for Medical Image SegmentationLec11: Active Contour and Level Set for Medical Image Segmentation
Lec11: Active Contour and Level Set for Medical Image Segmentation
 
分子間力セミナー
分子間力セミナー分子間力セミナー
分子間力セミナー
 
The Perceptron and its Learning Rule
The Perceptron and its Learning RuleThe Perceptron and its Learning Rule
The Perceptron and its Learning Rule
 
Image-to-Image Translation pix2pix
Image-to-Image Translation pix2pixImage-to-Image Translation pix2pix
Image-to-Image Translation pix2pix
 
Yolo v2 ai_tech_20190421
Yolo v2 ai_tech_20190421Yolo v2 ai_tech_20190421
Yolo v2 ai_tech_20190421
 
学習係数
学習係数学習係数
学習係数
 
Lecture 3 image sampling and quantization
Lecture 3 image sampling and quantizationLecture 3 image sampling and quantization
Lecture 3 image sampling and quantization
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
Back propagation using sigmoid & ReLU function
Back propagation using sigmoid & ReLU functionBack propagation using sigmoid & ReLU function
Back propagation using sigmoid & ReLU function
 
Physics-Informed Machine Learning
Physics-Informed Machine LearningPhysics-Informed Machine Learning
Physics-Informed Machine Learning
 
Camera calibration technique
Camera calibration techniqueCamera calibration technique
Camera calibration technique
 
Towards Causal Representation Learning
Towards Causal Representation LearningTowards Causal Representation Learning
Towards Causal Representation Learning
 
ディジタル信号処理の課題解説 その3
ディジタル信号処理の課題解説 その3ディジタル信号処理の課題解説 その3
ディジタル信号処理の課題解説 その3
 
Neural Processes
Neural ProcessesNeural Processes
Neural Processes
 
GoogLenet
GoogLenetGoogLenet
GoogLenet
 
人工知能2018 強化学習の応用
人工知能2018 強化学習の応用人工知能2018 強化学習の応用
人工知能2018 強化学習の応用
 

Similar to The neural tangent link between CNN denoisers and non-local filters

Neural Networks. Overview
Neural Networks. OverviewNeural Networks. Overview
Neural Networks. Overview
Oleksandr Baiev
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
Fabian Pedregosa
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
PrabhuSelvaraj15
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
Vahid Mirjalili
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
taeseon ryu
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
Junaid Bhat
 
NeurIPS22.pptx
NeurIPS22.pptxNeurIPS22.pptx
NeurIPS22.pptx
Julián Tachella
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization
taeseon ryu
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
Tapas Majumdar
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
Tamer Ahmed Farrag, PhD
 
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
Wuhyun Rico Shin
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
CastLabKAIST
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
Owin Will
 
EUSIPCO19
EUSIPCO19EUSIPCO19
EUSIPCO19
Julián Tachella
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Ahmed Yousry
 
Analysis of Non Linear Filters with Various Density of Impulse Noise for Diff...
Analysis of Non Linear Filters with Various Density of Impulse Noise for Diff...Analysis of Non Linear Filters with Various Density of Impulse Noise for Diff...
Analysis of Non Linear Filters with Various Density of Impulse Noise for Diff...
IJERA Editor
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
Gaurav Mittal
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformers
NEERAJ BAGHEL
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15
Hao Zhuang
 

Similar to The neural tangent link between CNN denoisers and non-local filters (20)

Neural Networks. Overview
Neural Networks. OverviewNeural Networks. Overview
Neural Networks. Overview
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
NeurIPS22.pptx
NeurIPS22.pptxNeurIPS22.pptx
NeurIPS22.pptx
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
EUSIPCO19
EUSIPCO19EUSIPCO19
EUSIPCO19
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Analysis of Non Linear Filters with Various Density of Impulse Noise for Diff...
Analysis of Non Linear Filters with Various Density of Impulse Noise for Diff...Analysis of Non Linear Filters with Various Density of Impulse Noise for Diff...
Analysis of Non Linear Filters with Various Density of Impulse Noise for Diff...
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformers
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15
 

More from Julián Tachella

Tutorial Equivariance in Imaging ICMS 23.pptx
Tutorial Equivariance in Imaging ICMS 23.pptxTutorial Equivariance in Imaging ICMS 23.pptx
Tutorial Equivariance in Imaging ICMS 23.pptx
Julián Tachella
 
Equivariant Imaging SELW'22
Equivariant Imaging SELW'22Equivariant Imaging SELW'22
Equivariant Imaging SELW'22
Julián Tachella
 
The role of overparameterization and optimization in CNN denoisers
The role of overparameterization and optimization in CNN denoisersThe role of overparameterization and optimization in CNN denoisers
The role of overparameterization and optimization in CNN denoisers
Julián Tachella
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
Julián Tachella
 
CAMSAP19
CAMSAP19CAMSAP19
ICASSP19
ICASSP19ICASSP19
Bayesian restoration of high-dimensional photon-starved images
Bayesian restoration of high-dimensional photon-starved imagesBayesian restoration of high-dimensional photon-starved images
Bayesian restoration of high-dimensional photon-starved images
Julián Tachella
 

More from Julián Tachella (7)

Tutorial Equivariance in Imaging ICMS 23.pptx
Tutorial Equivariance in Imaging ICMS 23.pptxTutorial Equivariance in Imaging ICMS 23.pptx
Tutorial Equivariance in Imaging ICMS 23.pptx
 
Equivariant Imaging SELW'22
Equivariant Imaging SELW'22Equivariant Imaging SELW'22
Equivariant Imaging SELW'22
 
The role of overparameterization and optimization in CNN denoisers
The role of overparameterization and optimization in CNN denoisersThe role of overparameterization and optimization in CNN denoisers
The role of overparameterization and optimization in CNN denoisers
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
 
CAMSAP19
CAMSAP19CAMSAP19
CAMSAP19
 
ICASSP19
ICASSP19ICASSP19
ICASSP19
 
Bayesian restoration of high-dimensional photon-starved images
Bayesian restoration of high-dimensional photon-starved imagesBayesian restoration of high-dimensional photon-starved images
Bayesian restoration of high-dimensional photon-starved images
 

Recently uploaded

road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 

Recently uploaded (20)

road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 

The neural tangent link between CNN denoisers and non-local filters

  • 1. 1 The neural tangent link between CNN denoisers and non-local filters Julián Tachella School of Engineering University of Edinburgh J. Tachella, J. Tang and Mike Davies, arXiv:2006.02379, 2020. Joint work with B. Tang and M. Davies
  • 2. The deep learning era 2 CNNs offer state of the art image denoising, e.g. restored image K. Zhang, W. Zuo, Y. Chen, D. Meng and L. Zhang, "Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising," IEEE Trans Image Proc, 2017.
  • 3. What is a CNN? 3 An 𝐿 layer vanilla CNN can be defined via the recursion 𝑎𝑖 1 = 𝑊𝑖 1 𝑦 𝑎𝑖 ℓ = 𝑗=1 𝑐 𝑊𝑖,𝑗 ℓ 𝜙(𝑎𝑖 ℓ−1 ) 𝑧 = 𝑗=1 𝑐 𝑊 𝑗 𝐿 𝜙(𝑎𝑖 𝐿−1 ) Weights 𝑤 adapted over clean training data, 𝑢, to minimize: argmin𝑤 𝑧 𝑤 − 𝑢 2 2 restored image Learning a very high-dimensional function 𝑧: ℝ𝑑 → ℝ𝑑
  • 4. 4 How do CNNs work? Do we really know? If not, how can we design novel, better, faster solutions? Today: What can be trained using a practical network and practical learning algorithms?
  • 5. 5 Deep image prior [Ulyanov et al., 2018]: denoising without training data Noise2Self [Batson, 2019] Corrupted target White noise CNN Autoencoder architecture The Deep Image Prior Minimize “self-supervised” loss: 𝑧 𝑤 − 𝑦 2 2 Corrupted target
  • 7. The big mystery 7 Deep image prior [Ulyanov et al., 2018] Network: 2M parameters Single target image: 49k pixels “self-supervised” loss: 𝑧 𝑤 − 𝑦 2 2 Global Minima has roughly 1.95 M dimensions.. Early-stopping consistently provides SOTA?! [Liu et al., 2020]
  • 8. Rethinking CNN Denoising 8 Classical algorithms (BM3D, NLM) also rely on a single corrupted image They provide similar results than DIP… Is there any link between them? BM3D 32.8 dB 31.5 dB
  • 9. Patch-based methods 9 Patch based methods as Global (kernel) filters [Milanfar, 2012,2014] Corrupted image 𝑦 ∈ ℝ𝑑 Filter matrix 𝑊 = diag( 1 1𝑇𝐾 )𝐾 with 𝐾𝑖,𝑗 = 𝑘 𝑦𝑖, 𝑦𝑗 e.g., non-local means 𝑘NLM 𝑦𝑖, 𝑦𝑗 = 𝑒 − 1 2𝜎2 𝑦𝑃𝑖−𝑦𝑃𝑗 2 2 where 𝑦𝑃𝑗 denotes the patch centred at 𝑦𝑗. 𝑦𝑃𝑖 𝑦𝑃𝑗
  • 10. 10 • Standard filtering: 𝑧 = 𝑊𝑦 • We can also iterate the solution (twicing): 𝑧𝑡+1 = 𝑧𝑡 + 𝑊 𝑦 − 𝑧𝑡 Eigendecomposition: 𝑊 = VΣ𝑉𝑇 𝑧𝑡 = 𝑖=1 𝑑 1 − 1 − 𝜆𝑖 𝑡 (𝑣𝑖 𝑇 𝑦) 𝑣𝑖 MSE = 𝑖=1 𝑑 1 − 𝜆𝑖 2𝑡 bias2 + 1 − 1 − 𝜆𝑖 𝑡 2 variance Early-stop for best trade-off Patch-based methods
  • 11. 11 Patch-based methods [Milanfar, 2012] Best denoising if the eigenvalue decay is sharp
  • 12. Training dynamics 12 CNN defined as 𝑧 𝑥, 𝑤 : ℝ𝑑 × ℝ𝑝 → ℝ𝑑 Loss ℒ 𝑤 = 𝑧 𝑤 − 𝑦 2 2 and GD training 𝑤𝑡+1 = 𝑤𝑡 − 𝜂 𝜕ℒ 𝜕𝑤 Taylor expansion in terms of initial weights 𝑧 𝑤 ≈ 𝑧 𝑤 = 𝑧 𝑤0 + 𝜕𝑧 𝜕𝑤 (𝑤 − 𝑤0) 𝑧𝑘+1 = 𝑧𝑘 + 𝜂 𝜕𝑧 𝜕𝑤 𝜕𝑧 𝜕𝑤 𝑇 (𝑦 − 𝑧𝑘 ) NTK 𝜂Θ ∈ PSD𝑑 plays the role of filter matrix W in twicing! 𝑧𝑘+1 = 𝑧𝑘 + 𝑊(𝑦 − 𝑧𝑘 )
  • 13. 13 𝜂Θ𝑖,𝑗 = 𝜂 𝜕𝑧𝑖 𝜕𝑤 𝜕𝑧𝑗 𝜕𝑤 𝑇 = ℓ=1 𝐿 𝑘=1 𝑐 𝜂 𝜕𝑧𝑖 𝜕𝑤𝑘 ℓ 𝜕𝑧𝑗 𝜕𝑤𝑘 ℓ A closer look to the neural tangent kernel Assume: 1. Overparameterization: channels of hidden layers 𝑐 → ∞ 2. Standard iid initialization of weights with variance ∝ 𝑐−1 (He, LeCun or Glorot initialization) 3. Correct learning rate 𝜂 ∝ 𝑐−1 to avoid divergent dynamics concentrates around its mean as 𝑶(𝒄−𝟎.𝟓)
  • 14. Neural Tangent Kernel 14 1. Each individual weight changes very slightly, less in hidden layers 2. The total change is vanishingly small sup 𝑡 𝑤𝑡 − 𝑤0 2 = 𝒪 𝑐−0.5 (hence Taylor expansion) 3. The preactivations at each layer 𝑎ℓ ∼ 𝒩(0, Σ𝑎ℓ) do not change significantly during training 4. Filters are random sup 𝑡 |𝑤ℓ,𝑖 𝑡 − 𝑤ℓ,𝑖 0 | = 𝒪(𝑐−1) if ℓ = 𝐿 (last) 𝒪(𝑐−3/2 ) otherwise 𝑎1 𝑎2 𝑥 𝑧
  • 15. Neural Tangent Kernel 15 NTK theory 1. NTK 𝜂Θ is fixed throughout training 2. It is fully characterized by the architecture, random initialization and input statistics 3. Linear dynamics describe well the evolution of the network [Lee et al., 2019] 4. NTK can be computed in closed form!
  • 16. CNN kernel 16 Each CNN block can be associated with an operator PSD𝑑 → PSD𝑑 1. Input: 𝑊 = 𝑥𝑥𝑇 2. Convolution layer: 𝒜 W 𝑖,𝑗 = 𝑊𝑖′,𝑗′ patch size = convolution kernel size 3. Upsampling and downsampling: 𝑊′ = 𝑈𝑊𝑈𝑇 where 𝑈 is the up(down)sampling matrix 4. Non-linearity: 𝑉 𝑊 = 𝔼𝑥∼𝒩(0,𝑊){ 𝜙 𝑥 𝜙 𝑥 𝑇 } and 𝑉′ 𝑊 = 𝔼𝑥∼𝒩 0,𝑊 { 𝜙′ 𝑥 𝜙′ 𝑥 𝑇 } e.g., relus 𝑉 𝑊 𝑖,𝑗 = 𝑊𝑖,𝑖𝑊 𝑗,𝑗 sin 𝜑 + 𝜋 − 𝜑 cos 𝜑 𝑉′ 𝑊 𝑖,𝑗 = 1 − 𝜑 𝜋 with 𝜑 = acos 𝑊𝑖,𝑗 𝑊𝑖,𝑖𝑊𝑗,𝑗
  • 17. Evaluated for 𝑥𝑃𝑖 = 𝑥𝑃𝑗 = 1 and using 𝜎 = 1 in NLM kernel CNN kernel 17 A simple example: 1 hidden layer CNN 𝑟 × 𝑟 convolutions NTK = non-local 𝑟 × 𝑟 patch-based filter 𝜂Θ 𝑖,𝑗 = 𝑘CNN 𝑥𝑖, 𝑥𝑗 = 𝑥𝑃𝑖 |𝑥𝑃𝑗| 𝜋 sin 𝜑 + 𝜋 − 𝜑 cos 𝜑 where 𝜑 is the angle between patches, and 𝑥𝑃𝑖 is the patch centred at 𝑥𝑖
  • 18. 18 CNN NTK with noisy house input: The kernel matrix exhibits very fast eigenvalue decay = degrees of freedom of linear smoother CNN kernel Input image CNN
  • 19. Nystrom denoising 19 CNN trained with GD Execution time 800 s Nystrom 1% of pixels Execution time 3 s Mystery solved, we can go to the beach now We can directly compute 𝜼𝚯 to do the filtering! 𝜂Θ is of size 𝑑 × 𝑑 prohibitive complexity for large images! • Low-rank Nystrom approximation of 𝜂Θ using 𝑚 ≪ 𝑑 columns [Milanfar 2014] • Only computing correlation with 1% of the patches is enough!
  • 20. Noise input 21 The DIP inputs iid noise 𝑥 ∼ 𝒩(0, 1), not the corrupted image!  resulting filter does not depend on the image in any way… Even worse, for a vanilla CNN we get 𝜂Θ 𝑖,𝑗 = 1 𝑑 1 if 𝑖 = 𝑗 0.25 otherwise This cannot be obtained via low-pass filtering! The DIP does not use GD, but Adam [Heckel, 2020] for a U-Net CNN the NTK is low-pass DIP Smoothing kernels [Cheng 2019]
  • 21. Adam optimizer 22 The Adam optimizer belongs to the family of adaptive gradient methods (Adagrad, RMSProp, etc) 𝑤𝑡+1 = 𝑤𝑡 − 𝜂𝐻𝑡 𝜕ℒ 𝜕𝑤 No running averages = sign gradient descent 1. The metric depends on the step size [Gunasekar et al., 2018] 2. Hidden layers have a larger change than in GD! 3. Dynamics not well described by Taylor around initialization 𝐻𝑡 running averages of squared gradient sup 𝑡 𝑤ℓ,𝑖 𝑡 − 𝑤ℓ,𝑖 0 = 𝒪 𝑐−1 ∀ℓ sup 𝑡 𝑤𝑡 − 𝑤0 2 = 𝒪 1 sign( 𝜕ℒ 𝜕𝑤 )
  • 22. Adam optimizer 23 Adaptive filtering: The NTK is not fixed throughout training 𝑧𝑡+1 = 𝑧𝑡 + 𝜂Θ𝑡 (𝑦 − 𝑧𝑡 ) At initialization: the matrix adapts using non-local information about the image residual 𝜕ℒ 𝜕𝑎𝐿 = 𝛿𝐿 = 𝑦 − 𝑧𝑡 through back propagation  Pre-activations 𝑎ℓ no longer remain constant - change the proportionally to 𝛿ℓ which adapt through back propagated nonlocal filters 𝑎1 𝑎2 𝛿𝐿 𝑥 𝑧 𝛿2 𝛿1 with Θ𝑡 = 𝜕𝑧 𝜕𝑤 𝐻𝑡 𝜕𝑧 𝜕𝑤 𝑇
  • 23. Experiments 24 We evaluate  U-Net architecture (8 hidden layers)  Autoencoder (no skip connections)  Single-hidden layer CNN Corrupted target White noise CNN Case 1: (DIP) Corrupted target Input image CNN Case 2: (c.f. Noise2Self)
  • 24. 25 Peak-signal-to-noise ratio (PSNR) results on standard denoising dataset
  • 26. Experiments 27 Autoencoder, input noise, Adam and GD training vs number of channels overparameterized regime
  • 27. Experiments 28 Autoencoder, input noise, Adam and GD training vs number of channels 𝑂(1) 𝑂(𝑐−0.5) 𝑂(𝑐−1 ) 𝑂(𝑐−1.5 ) Hence each weight makes a similar small contribution (in contrast to convolutional sparse coding model)
  • 28. 29 Experiments Leading eigenvectors of the last hidden preactivations
  • 29. Conclusions 30  CNNs can be seen as exploiting some form of nonlocal filter structure  Hence CNNs have a very strong bias towards clean images  Effective degrees of freedom ≪ parameters in the network  Use Nystrom to avoid training 2M parameters  Optimizer plays key role Future work  Fast approximations of other CNNs?  Learn better image models from CNNs?  Extend to more general imaging inverse problems  Understanding Adam training dynamics (and hence the DIP)
  • 30. Thanks for your attention! Tachella.github.io  Codes  Presentations  … and more 31

Editor's Notes

  1. State of the Art Image denoisers are CNNs Seem to require many weights (500k - 2M) Lots of training data (so are they susceptible to domain shift?) And lots of training time Is this correct? What exactly do they learn?
  2. State of the Art Image denoisers are CNNs Seem to require many weights (500k - 2M) Lots of training data (so are they susceptible to domain shift?) And lots of training time Is this correct? What exactly do they learn?
  3. So how do CNNs work / what are they learning? – I like this analogy There is no lack of descriptions to what a CNN is doing and how it learns what it does. CNNs mimic the brain Convolutional sparse coding Similarity to human visual system Heirarchically learning deep abstractions… Kernel methods The answer is I think that nobody really knows. In which case (to that extent this talk will be no different) I will be looking at the relationship between CNNs an nonlocal filters
  4. Recent paper by Ulyanov raised a number of questions here. DIP – essentially a U-net with very few skip connections with random noise as the input Trained on a single image that is simultaneously the training and testing image. The loss is simply the self-supervised loss (L2 error between the output and the noisy image) … so why would we want to do this???
  5. That is the big mystery The CNN has 2M parameters compared to the images 50K pixels That means the cost function typicaly has a zero error global minima set of ~1.95M dimensions…. Lots of solutions …all of which just output the original noisy image .. So how come early stopping of training consistently provides SOTA performance … and not just in denoising is other image processing problems too such as inpainting
  6. That is the big mystery The CNN has 2M parameters compared to the images 50K pixels That means the cost function typicaly has a zero error global minima set of ~1.95M dimensions…. Lots of solutions …all of which just output the original noisy image .. So how come early stopping of training consistently provides SOTA performance … and not just in denoising is other image processing problems too such as inpainting
  7. So let’s try to rethink CNN denoising. Note: BM3D also works on a single image and “learns” the structure from the corrupted image itself DIP and BM3D produce similar results/performance Idea: May be there’s a link
  8. The key ideas are: Patch based methods can be thought of as applying a global linear operation using the appropriate matrix W Composed of a normalized kernel affinity matrix measuring patch similarity, e.g. for NLM This is not really a linear transform since the filter operation W is itself data dependent. Standard filtering applies W to the noisy image (average over similar patches) Alternatively we can iterate the estimate as in Twicing (Tukey) which trades bias for variance. Blurry estimate converges towards the noisy target as t-> inf As with the DIP, the procedure is early-stopped to avoid overfitting the noise. So can we relate this to the DIP estimation?
  9. Digging a bit deeper we can see that: individual weights have negligible change L2 norm of difference is vanishingly small (hence Taylor expansion) Similary the pre-activations in the net don’t change significantly and the filters are random It is currently an open question of when and how well the NTK model actually describes real DNNs…. We will come back to this
  10. Here we can use a really nice result from a recent paper by Jacot and co workers
  11. As a simple example let’s consider a CNN with a single hidden layer. The rxr convolutions and nonlinearity calculate a non-local patch based affinity matrix with the following kernel While this patch based similarity metric is different to that of say NLM, if we normalise the patches we can see that it has a very similar form.. (though I have chosen parameters judiciously to maximise the similarity) Ultimately this means we can directly compute the NTK and do filtering
  12. The NTK matrix for a simple CNN with noisy Baboon as input has the following eigenspectrum
  13. Although NTK matrix is prohibitively large – we can exploit convolutional structure Also no need to iterate: we can solve directly using Nystrom low rank approx. Equiv to only computing correlations with 1% of patches Denoises without training
  14. What we have just described is closer to the Noise2Self model… a sort of autoencoder, which also can denoise a signal image without additional training data The noisy image is placed as both the input and the target In Baton’s paper they use a carefully constructed self supervised loss that avoids learning the identity mapping (which is credited with the success of the method) However with early stopping it works without this What we have seen is that even without this such model
  15. Our analysis here is very preliminary… so I will only give a brief overview. Adam adapts the gradient using running averages… an analytical approximation for this (removing the running averages) is signed gradient descent (not perfect but highlights what we want to show) With signed GD individual weights still only undergo small changes… but larger than before (Order c^{-3/2} for L_infinity, Order c^{-0.5} for L2) But L2 change is large – hence Taylor expansion not good approximation
  16. Thus the NTK is NOT constant throughout training. Following a mean field theory for CNNs we conjecture that the matrix adapts using nonlocal information calculated about the image residual through BP of errors This results in non constant pre-activations with covariances that can be calculated recursively using similar matrix operators as in the NTK
  17. So let’s examine this ideas in some experiments. We looked at two CNNs – the U-net as used in DIP, and our vanilla CNN Here we consider two set ups – the DIP set up (noise as input) and the Noise2Self set up (input image as input) Note that in each scenario the expressive power of the U-net networks is sufficient to achieve zero training error (i.e. we can fully predict the noisy image in each case)
  18. We also looked at training with Adam and with GD and evaluated these methods on a standard denoising dataset (Gaussian noise sigma = 25, PSNR = 20.18) What we see broadly fits our theory: With image as input both CNNs perform reasonable denoising using either GD or Adam; Adam is better for the U-net; With Noise as an input we now see a significant difference between GD and Adam training. GD provides poor denoising, while Adam seems to have been able to adapt to the target image data We also calculated the Nystrom version of the vanilla net and got similar (even slightly better) performance to GD and Adam with image as input (Nystrom provides additional low rank regularisation that may explain the improvement over the CNN here)
  19. If we now look at what the images looked like we in particular see that GD training with noise as input acts as a crude LPF (particularly for the U-net) In the other cases: noise+Adam or image+GD/Adam we have a good estimate and certainly have not experienced LPF – all images preserve detail structure (Sigma = 25, PSNR = 20.18, CBM3D = 33.03)
  20. So in our next experiment we looked at U-net+noise input and the difference between Adam and GD training versus the number of channels Observations: Adam has saturated as the number of channels grows – suggests we are not just seeing finite width effects
  21. Similarly when we look at the change in weights we see broadly what we have predicted… First the L2 change in weights for Adam is order 1 hence not in the NTK regime In contrast for GD weight change decays roughly as C^{-0.5} Next looking at the l_infinity norm of the weights we see that individualy all weights have a change that decays with the number of channels, suggesting that each weight provides a similar small contribution to the solution (in contrast with convolutional sparse coding arguments where only a few weights contribute significantly
  22. Finally it is instructive to look at the eigenvectors of the last hidden preactivations According to NTK theory GD does not modify the distribution of the preactivations during training, hence they remain non-informative (low-pass) with noise at the input (left) However, they carry non-local information when the image is placed at the input. (right) The centre pictures show the eigenvectors of Adam trained network with noise at the input and we see similar information to the GD with an image input
  23. So what have we learnt? We know that CNNs have a very strong bias towards clean images. We have argued that this is due to a natural nonlocal model existing in CNNs We have also seen that the choice of optimiser plays a key role – this is currently a hot topic in mean field modelling of NNs The effective degrees of freedom of the network are very different to that of the number of parameters in the net