1. The document discusses the link between CNN denoisers and non-local filters, showing that CNNs can be viewed as exploiting a form of nonlocal filter structure.
2. It demonstrates that the neural tangent kernel (NTK) of a CNN trained with gradient descent plays the role of a filter matrix, and the eigenvalues of the NTK decay rapidly, similar to patch-based nonlocal filters.
3. Experiments on image denoising show that a U-Net, autoencoder, and single-layer CNN trained with Adam can achieve state-of-the-art results, and the NTK depends on the optimizer, suggesting it adapts using nonlocal information about the image residual.
COM2304: Intensity Transformation and Spatial Filtering – III Spatial Filters...Hemantha Kulathilake
At the end of this lecture, you should be able to;
describe sharpening through spatial filters.
Identify usage of derivatives in Image Processing.
discuss edge detection techniques.
compare 1st & 2nd order derivatives used for sharpening.
Apply sharpening techniques for problem solving.
COM2304: Intensity Transformation and Spatial Filtering – III Spatial Filters...Hemantha Kulathilake
At the end of this lecture, you should be able to;
describe sharpening through spatial filters.
Identify usage of derivatives in Image Processing.
discuss edge detection techniques.
compare 1st & 2nd order derivatives used for sharpening.
Apply sharpening techniques for problem solving.
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...Hiroyuki KASAI
Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large, but finite, number of loss functions. In this paper, we propose a novel Riemannian extension of the Euclidean stochastic variance reduced gradient algorithm (R-SVRG) to a compact manifold search space. To this end, we show the developments on the Grassmann manifold. The key challenges of averaging, addition, and subtraction of multiple gradients are addressed with notions like logarithm mapping and parallel translation of vectors on the Grassmann manifold. We present a global convergence analysis of the proposed algorithm with a decay step-size and a local convergence rate analysis under a fixed step-size with under some natural assumptions. The proposed algorithm is applied on a number of problems on the Grassmann manifold like principal components analysis, low-rank matrix completion, and the Karcher mean computation. In all these cases, the proposed algorithm outperforms the standard Riemannian stochastic gradient descent algorithm.
Lec11: Active Contour and Level Set for Medical Image SegmentationUlaş Bağcı
ActiveContour(Snake) • LevelSet
• Applications
Enhancement, Noise Reduction, and Signal Processing • MedicalImageRegistration • MedicalImageSegmentation • MedicalImageVisualization • Machine Learning in Medical Imaging • Shape Modeling/Analysis of Medical Images Deep Learning in Radiology Fuzzy Connectivity (FC) – Affinity functions • Absolute FC • Relative FC (and Iterative Relative FC) • Successful example applications of FC in medical imaging • Segmentation of Airway and Airway Walls using RFC based method Energyfunctional – Data and Smoothness terms • GraphCut – Min cut – Max Flow • ApplicationsinRadiologyImages
1
The Perceptron and its Learning Rule
Carlo U. Nicola, SGI FH Aargau
With extracts from publications of :
M. Minsky, MIT, Demuth, U. of Colorado,
D.J. C. MacKay, Cambridge University
WBS WS06-07 2
Perceptron
(i) Single layer ANN
(ii) It works with continuous or binary inputs
(iii) It stores pattern pairs (Ak,Ck) where: Ak = (a1
k, …, an
k)
and Ck = (c1
k, …, cn
k) are bipolar valued [-1, +1].
(iv) It applies the perceptron error-correction procedure, which
always converges.
(v) A perceptron is a classifier.
Bias b is sometimes called θ
Towards Causal Representation LearningSuyeong Park
Schölkopf, Bernhard, et al. "Toward causal representation learning." (2021) https://arxiv.org/pdf/2102.11107.pdf
지난 7월 6일 Causal Inference KR 에서 발표한 위 논문의 발표 자료입니다. 아직 배우는 학생이라 많이 부족할테지만, 틀렸거나 애매한 설명은 언제든 댓글로 질문이나 첨언 부탁드립니다.
감사합니다.
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...Hiroyuki KASAI
Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large, but finite, number of loss functions. In this paper, we propose a novel Riemannian extension of the Euclidean stochastic variance reduced gradient algorithm (R-SVRG) to a compact manifold search space. To this end, we show the developments on the Grassmann manifold. The key challenges of averaging, addition, and subtraction of multiple gradients are addressed with notions like logarithm mapping and parallel translation of vectors on the Grassmann manifold. We present a global convergence analysis of the proposed algorithm with a decay step-size and a local convergence rate analysis under a fixed step-size with under some natural assumptions. The proposed algorithm is applied on a number of problems on the Grassmann manifold like principal components analysis, low-rank matrix completion, and the Karcher mean computation. In all these cases, the proposed algorithm outperforms the standard Riemannian stochastic gradient descent algorithm.
Lec11: Active Contour and Level Set for Medical Image SegmentationUlaş Bağcı
ActiveContour(Snake) • LevelSet
• Applications
Enhancement, Noise Reduction, and Signal Processing • MedicalImageRegistration • MedicalImageSegmentation • MedicalImageVisualization • Machine Learning in Medical Imaging • Shape Modeling/Analysis of Medical Images Deep Learning in Radiology Fuzzy Connectivity (FC) – Affinity functions • Absolute FC • Relative FC (and Iterative Relative FC) • Successful example applications of FC in medical imaging • Segmentation of Airway and Airway Walls using RFC based method Energyfunctional – Data and Smoothness terms • GraphCut – Min cut – Max Flow • ApplicationsinRadiologyImages
1
The Perceptron and its Learning Rule
Carlo U. Nicola, SGI FH Aargau
With extracts from publications of :
M. Minsky, MIT, Demuth, U. of Colorado,
D.J. C. MacKay, Cambridge University
WBS WS06-07 2
Perceptron
(i) Single layer ANN
(ii) It works with continuous or binary inputs
(iii) It stores pattern pairs (Ak,Ck) where: Ak = (a1
k, …, an
k)
and Ck = (c1
k, …, cn
k) are bipolar valued [-1, +1].
(iv) It applies the perceptron error-correction procedure, which
always converges.
(v) A perceptron is a classifier.
Bias b is sometimes called θ
Towards Causal Representation LearningSuyeong Park
Schölkopf, Bernhard, et al. "Toward causal representation learning." (2021) https://arxiv.org/pdf/2102.11107.pdf
지난 7월 6일 Causal Inference KR 에서 발표한 위 논문의 발표 자료입니다. 아직 배우는 학생이라 많이 부족할테지만, 틀렸거나 애매한 설명은 언제든 댓글로 질문이나 첨언 부탁드립니다.
감사합니다.
Restricting the Flow: Information Bottlenecks for Attributiontaeseon ryu
101번째 영상,
펀디멘탈팀 김준호 님의
Restricting the Flow: Information Bottlenecks for Attribution
논문 리뷰 입니다
Explanable ai, xai와 관련된 페이퍼 입니다! 관련되어 관심있으신 분들이 많은 도움이 되시길 바랍니다! attribution map을 이용하여 결과물에 영향을 준 네트워크의 gradient를 직접 추적하여 비주얼 explanation을 추적하는 방식입니다! 펀디멘탈팀 김준호님이 밑바닥부터 자세한 리뷰를 도와주셨습니다!
오늘도 많은 관심과 사랑 감사합니다!
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
발표자: 김준태 (KAIST 박사과정)
발표일: 2018.10
Voice activity detection (VAD) and speech enhancement (SE) are important front-end technologies for noise robust speech recognition system.
From incoming noisy signal, VAD detects the speech signal only and SE removes the noise signal while conserving the speech signal.
For VAD and SE, this presentation will cover the traditional methods, deep learning based methods, and our papers as follows:
1. J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181-1185, Aug. 2018.
2. J. Kim and M. Hahn, "Speech Enhancement Using a Two Step Network," submitted to IEEE Signal Processing Letters, 2018.
Also, this presentation will briefly introduce some experimental results in real-world environment (far-field, noisy environment), conducted on the embedded board.
For VAD,
Traditional VAD methods.
Deep learning based VAD methods.
Paper presentation: J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181-1185, Aug. 2018.
End point detection based on VAD.
Experimental results of DNN-EPD on embedded board in real-world environment.
For SE,
Traditional SE methods.
Deep learning based SE methods.
Paper presentation: J. Kim and M. Hahn, "Speech Enhancement Using a Two Step Network," submitted to IEEE Signal Processing Letters, 2018.
Experimental results in real-world environment.
Deep learning (also known as deep structured learning or hierarchical learning) is the application of artificial neural networks (ANNs) to learning tasks that contain more than one hidden layer. Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, partially supervised or unsupervised.
안녕하세요 딥러닝 논문 읽기 모임입니다. 오늘 업로드된 논문 리뷰 영상은 'Transformer Interpretability Beyond Attention Visualization'라는 제목의 논문입니다.
트랜스포머는 지금 까지 논문 리뷰 영상을 업로드 하면서 가장 많이 언급한 모델중 하나입니다. NLP를 넘어, 이미지 처리 매우 많은 영역에서 소타 네트워크로 쓰였습니다. 해당 논문은 이미지 처리 영역에서의 Transformer가 의사결정을 내리는 과정에 대해 특히 self Attention 모듈에 관해 다양한 방법으로 심층적으로 연구한 논문 입니다!
오늘 논문 리뷰를 위해 펀디멘탈팀 김채현님이 자세한리뷰 도와주셨습니다!
많은 관심 미리 감사드립니다!
https://youtu.be/XCED5bd2WT0
Analysis of Non Linear Filters with Various Density of Impulse Noise for Diff...IJERA Editor
Corrupted digital images are recovered by using median filters. The most frequently occur noise is salt and pepper type impulse noise. As the noise increases it becomes hard to recover the noisy digital image. Different median filters have been suggested to recover it. Size of the window taken in the filter is also the important factor at different level of noises. The performance of standard median filter (SMF), centered weighted median (CWM) filter and directional weighted median (DWM) filter is tested on gray scale images corrupted with variable percentage of salt & pepper noise impulse noise. It is also tested for different window sizes of filters. Some filter performs better at low noise while some performs better at high noise. At higher level of noise, large window size in the filters works better than small window size. These comparisons are very helpful in deciding the best filter at different level of noise.
A comprehensive tutorial on Convolutional Neural Networks (CNN) which talks about the motivation behind CNNs and Deep Learning in general, followed by a description of the various components involved in a typical CNN layer. It explains the theory involved with the different variants used in practice and also, gives a big picture of the whole network by putting everything together.
Next, there's a discussion of the various state-of-the-art frameworks being used to implement CNNs to tackle real-world classification and regression problems.
Finally, the implementation of the CNNs is demonstrated by implementing the paper 'Age ang Gender Classification Using Convolutional Neural Networks' by Hassner (2015).
We propose an efficient algorithmic framework for time domain circuit simulation using exponential integrators. This work addresses several critical issues exposed by previous matrix exponential based circuit simulation research, and makes it capable of simulating stiff nonlinear circuit system at a large scale. In this framework, the system’s nonlinearity is treated with exponential Rosenbrock-Euler formulation. The matrix exponential and vector product is computed using invert Krylov subspace method. Our proposed method has several distinguished advantages over conventional formulations (e.g., the well-known backward Euler with Newton-Raphson method). The matrix factorization is performed only for the conductance/resistance matrix G, without being performed for the combinations of the capacitance/inductance matrix C and matrix G, which are used in traditional implicit formulations. Furthermore, due to the explicit nature of our formulation, we do not need to repeat LU decompositions when adjusting the length of time steps for error controls. Our algorithm is better suited to solving tightly coupled post-layout circuits in the pursuit for full-chip simulation. Our experimental results validate the advantages of our framework.
Similar to The neural tangent link between CNN denoisers and non-local filters (20)
PhD defence public presentation, Bayesian methods for inverse problems with point clouds: applications to single-photon lidar, ENSEEHIT, Toulouse, France
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Gen AI Study Jams _ For the GDSC Leads in India.pdf
The neural tangent link between CNN denoisers and non-local filters
1. 1
The neural tangent link between CNN
denoisers and non-local filters
Julián Tachella
School of Engineering
University of Edinburgh
J. Tachella, J. Tang and Mike Davies, arXiv:2006.02379, 2020.
Joint work with B. Tang and M. Davies
2. The deep learning era
2
CNNs offer state of the art image denoising, e.g.
restored image
K. Zhang, W. Zuo, Y. Chen, D. Meng and L. Zhang, "Beyond a Gaussian Denoiser:
Residual Learning of Deep CNN for Image Denoising," IEEE Trans Image Proc, 2017.
3. What is a CNN?
3
An 𝐿 layer vanilla CNN can be defined via the recursion
𝑎𝑖
1
= 𝑊𝑖
1
𝑦
𝑎𝑖
ℓ
=
𝑗=1
𝑐
𝑊𝑖,𝑗
ℓ
𝜙(𝑎𝑖
ℓ−1
)
𝑧 =
𝑗=1
𝑐
𝑊
𝑗
𝐿
𝜙(𝑎𝑖
𝐿−1
)
Weights 𝑤 adapted over clean training data, 𝑢, to minimize:
argmin𝑤 𝑧 𝑤 − 𝑢 2
2
restored image
Learning a very high-dimensional function 𝑧: ℝ𝑑
→ ℝ𝑑
4. 4
How do CNNs work?
Do we really know?
If not, how can we design novel, better,
faster solutions?
Today:
What can be trained using a practical
network and practical learning
algorithms?
5. 5
Deep image prior [Ulyanov et al., 2018]: denoising without training data
Noise2Self [Batson, 2019]
Corrupted target
White noise
CNN
Autoencoder architecture
The Deep Image Prior
Minimize “self-supervised” loss: 𝑧 𝑤 − 𝑦 2
2
Corrupted target
7. The big mystery
7
Deep image prior [Ulyanov et al., 2018]
Network: 2M parameters
Single target image: 49k pixels
“self-supervised” loss: 𝑧 𝑤 − 𝑦 2
2
Global Minima has roughly 1.95 M dimensions..
Early-stopping consistently provides SOTA?!
[Liu et al., 2020]
8. Rethinking CNN Denoising
8
Classical algorithms (BM3D, NLM) also rely on a single corrupted image
They provide similar results than DIP…
Is there any link between them?
BM3D
32.8 dB
31.5 dB
9. Patch-based methods
9
Patch based methods as Global (kernel) filters [Milanfar, 2012,2014]
Corrupted image 𝑦 ∈ ℝ𝑑
Filter matrix 𝑊 = diag(
1
1𝑇𝐾
)𝐾 with 𝐾𝑖,𝑗 = 𝑘 𝑦𝑖, 𝑦𝑗
e.g., non-local means
𝑘NLM 𝑦𝑖, 𝑦𝑗 = 𝑒
−
1
2𝜎2 𝑦𝑃𝑖−𝑦𝑃𝑗
2
2
where 𝑦𝑃𝑗 denotes the patch centred at 𝑦𝑗.
𝑦𝑃𝑖
𝑦𝑃𝑗
10. 10
• Standard filtering: 𝑧 = 𝑊𝑦
• We can also iterate the solution (twicing):
𝑧𝑡+1 = 𝑧𝑡 + 𝑊 𝑦 − 𝑧𝑡
Eigendecomposition: 𝑊 = VΣ𝑉𝑇
𝑧𝑡 = 𝑖=1
𝑑
1 − 1 − 𝜆𝑖
𝑡 (𝑣𝑖
𝑇
𝑦) 𝑣𝑖
MSE = 𝑖=1
𝑑
1 − 𝜆𝑖
2𝑡
bias2
+ 1 − 1 − 𝜆𝑖
𝑡 2
variance
Early-stop for best trade-off
Patch-based methods
12. Training dynamics
12
CNN defined as 𝑧 𝑥, 𝑤 : ℝ𝑑
× ℝ𝑝
→ ℝ𝑑
Loss ℒ 𝑤 = 𝑧 𝑤 − 𝑦 2
2
and GD training 𝑤𝑡+1
= 𝑤𝑡
− 𝜂
𝜕ℒ
𝜕𝑤
Taylor expansion in terms of initial weights 𝑧 𝑤 ≈ 𝑧 𝑤 = 𝑧 𝑤0 +
𝜕𝑧
𝜕𝑤
(𝑤 − 𝑤0)
𝑧𝑘+1
= 𝑧𝑘
+ 𝜂
𝜕𝑧
𝜕𝑤
𝜕𝑧
𝜕𝑤
𝑇
(𝑦 − 𝑧𝑘
)
NTK 𝜂Θ ∈ PSD𝑑
plays the role of filter matrix W in twicing!
𝑧𝑘+1
= 𝑧𝑘
+ 𝑊(𝑦 − 𝑧𝑘
)
13. 13
𝜂Θ𝑖,𝑗 = 𝜂
𝜕𝑧𝑖
𝜕𝑤
𝜕𝑧𝑗
𝜕𝑤
𝑇
=
ℓ=1
𝐿
𝑘=1
𝑐
𝜂
𝜕𝑧𝑖
𝜕𝑤𝑘
ℓ
𝜕𝑧𝑗
𝜕𝑤𝑘
ℓ
A closer look to the neural tangent kernel
Assume:
1. Overparameterization: channels of hidden layers 𝑐 → ∞
2. Standard iid initialization of weights with variance ∝ 𝑐−1
(He, LeCun or Glorot initialization)
3. Correct learning rate 𝜂 ∝ 𝑐−1
to avoid divergent dynamics
concentrates around its mean as 𝑶(𝒄−𝟎.𝟓)
14. Neural Tangent Kernel
14
1. Each individual weight changes very slightly, less in hidden layers
2. The total change is vanishingly small sup
𝑡
𝑤𝑡
− 𝑤0
2
= 𝒪 𝑐−0.5
(hence Taylor expansion)
3. The preactivations at each layer 𝑎ℓ
∼ 𝒩(0, Σ𝑎ℓ) do not change significantly during training
4. Filters are random
sup
𝑡
|𝑤ℓ,𝑖
𝑡
− 𝑤ℓ,𝑖
0
| =
𝒪(𝑐−1) if ℓ = 𝐿 (last)
𝒪(𝑐−3/2
) otherwise
𝑎1
𝑎2
𝑥 𝑧
15. Neural Tangent Kernel
15
NTK theory
1. NTK 𝜂Θ is fixed throughout training
2. It is fully characterized by the architecture, random initialization and input statistics
3. Linear dynamics describe well the evolution of the network [Lee et al., 2019]
4. NTK can be computed in closed form!
16. CNN kernel
16
Each CNN block can be associated with an operator PSD𝑑 → PSD𝑑
1. Input: 𝑊 = 𝑥𝑥𝑇
2. Convolution layer: 𝒜 W 𝑖,𝑗 = 𝑊𝑖′,𝑗′ patch size = convolution kernel size
3. Upsampling and downsampling: 𝑊′
= 𝑈𝑊𝑈𝑇
where 𝑈 is the up(down)sampling matrix
4. Non-linearity: 𝑉 𝑊 = 𝔼𝑥∼𝒩(0,𝑊){ 𝜙 𝑥 𝜙 𝑥 𝑇
} and 𝑉′ 𝑊 = 𝔼𝑥∼𝒩 0,𝑊 { 𝜙′ 𝑥 𝜙′ 𝑥 𝑇
}
e.g., relus 𝑉 𝑊 𝑖,𝑗 = 𝑊𝑖,𝑖𝑊
𝑗,𝑗 sin 𝜑 + 𝜋 − 𝜑 cos 𝜑
𝑉′ 𝑊 𝑖,𝑗 = 1 −
𝜑
𝜋
with 𝜑 = acos
𝑊𝑖,𝑗
𝑊𝑖,𝑖𝑊𝑗,𝑗
17. Evaluated for 𝑥𝑃𝑖 = 𝑥𝑃𝑗 = 1
and using 𝜎 = 1 in NLM kernel
CNN kernel
17
A simple example: 1 hidden layer CNN
𝑟 × 𝑟 convolutions NTK = non-local 𝑟 × 𝑟 patch-based filter
𝜂Θ 𝑖,𝑗 = 𝑘CNN 𝑥𝑖, 𝑥𝑗 =
𝑥𝑃𝑖 |𝑥𝑃𝑗|
𝜋
sin 𝜑 + 𝜋 − 𝜑 cos 𝜑
where 𝜑 is the angle between patches, and 𝑥𝑃𝑖 is the patch centred at
𝑥𝑖
18. 18
CNN NTK with noisy house input:
The kernel matrix exhibits very fast eigenvalue decay = degrees of freedom of linear smoother
CNN kernel
Input image
CNN
19. Nystrom denoising
19
CNN trained with GD
Execution time 800 s
Nystrom 1% of pixels
Execution time 3 s Mystery solved, we can go to the beach now
We can directly compute 𝜼𝚯 to do the filtering!
𝜂Θ is of size 𝑑 × 𝑑 prohibitive complexity for large images!
• Low-rank Nystrom approximation of 𝜂Θ using 𝑚 ≪ 𝑑 columns [Milanfar 2014]
• Only computing correlation with 1% of the patches is enough!
20. Noise input
21
The DIP inputs iid noise 𝑥 ∼ 𝒩(0, 1), not the corrupted image!
resulting filter does not depend on the image in any way…
Even worse, for a vanilla CNN we get 𝜂Θ 𝑖,𝑗 = 1
𝑑
1 if 𝑖 = 𝑗
0.25 otherwise
This cannot be obtained via low-pass filtering!
The DIP does not use GD, but Adam
[Heckel, 2020] for a U-Net CNN
the NTK is low-pass
DIP Smoothing kernels
[Cheng 2019]
21. Adam optimizer
22
The Adam optimizer belongs to the family of adaptive gradient methods (Adagrad, RMSProp, etc)
𝑤𝑡+1 = 𝑤𝑡 − 𝜂𝐻𝑡
𝜕ℒ
𝜕𝑤
No running averages = sign gradient descent
1. The metric depends on the step size [Gunasekar et al., 2018]
2. Hidden layers have a larger change than in GD!
3. Dynamics not well described by Taylor around initialization
𝐻𝑡 running averages of squared
gradient
sup
𝑡
𝑤ℓ,𝑖
𝑡
− 𝑤ℓ,𝑖
0
= 𝒪 𝑐−1
∀ℓ
sup
𝑡
𝑤𝑡 − 𝑤0
2
= 𝒪 1
sign(
𝜕ℒ
𝜕𝑤
)
22. Adam optimizer
23
Adaptive filtering: The NTK is not fixed throughout training
𝑧𝑡+1
= 𝑧𝑡
+ 𝜂Θ𝑡
(𝑦 − 𝑧𝑡
)
At initialization: the matrix adapts using non-local information about the image residual
𝜕ℒ
𝜕𝑎𝐿 = 𝛿𝐿
= 𝑦 − 𝑧𝑡
through back propagation
Pre-activations 𝑎ℓ
no longer remain constant - change the proportionally to 𝛿ℓ
which adapt through
back propagated nonlocal filters
𝑎1
𝑎2
𝛿𝐿
𝑥 𝑧
𝛿2
𝛿1
with Θ𝑡 =
𝜕𝑧
𝜕𝑤
𝐻𝑡 𝜕𝑧
𝜕𝑤
𝑇
23. Experiments
24
We evaluate
U-Net architecture (8 hidden layers)
Autoencoder (no skip connections)
Single-hidden layer CNN
Corrupted target
White noise
CNN
Case 1: (DIP)
Corrupted target
Input image
CNN
Case 2: (c.f. Noise2Self)
27. Experiments
28
Autoencoder, input noise, Adam and GD training vs number of channels
𝑂(1)
𝑂(𝑐−0.5)
𝑂(𝑐−1
)
𝑂(𝑐−1.5
)
Hence each weight makes a similar small
contribution (in contrast to convolutional
sparse coding model)
29. Conclusions
30
CNNs can be seen as exploiting some form of nonlocal filter structure
Hence CNNs have a very strong bias towards clean images
Effective degrees of freedom ≪ parameters in the network
Use Nystrom to avoid training 2M parameters
Optimizer plays key role
Future work
Fast approximations of other CNNs?
Learn better image models from CNNs?
Extend to more general imaging inverse problems
Understanding Adam training dynamics (and hence the DIP)
30. Thanks for your attention!
Tachella.github.io
Codes
Presentations
… and more
31
Editor's Notes
State of the Art Image denoisers are CNNs
Seem to require many weights (500k - 2M)
Lots of training data (so are they susceptible to domain shift?)
And lots of training time
Is this correct? What exactly do they learn?
State of the Art Image denoisers are CNNs
Seem to require many weights (500k - 2M)
Lots of training data (so are they susceptible to domain shift?)
And lots of training time
Is this correct? What exactly do they learn?
So how do CNNs work / what are they learning? – I like this analogy
There is no lack of descriptions to what a CNN is doing and how it learns what it does.
CNNs mimic the brain
Convolutional sparse coding
Similarity to human visual system
Heirarchically learning deep abstractions…
Kernel methods
The answer is I think that nobody really knows. In which case
(to that extent this talk will be no different) I will be looking at the relationship between CNNs an nonlocal filters
Recent paper by Ulyanov raised a number of questions here.
DIP – essentially a U-net with very few skip connections with random noise as the input
Trained on a single image that is simultaneously the training and testing image.
The loss is simply the self-supervised loss (L2 error between the output and the noisy image)
… so why would we want to do this???
That is the big mystery
The CNN has 2M parameters compared to the images 50K pixels
That means the cost function typicaly has a zero error global minima set of ~1.95M dimensions…. Lots of solutions
…all of which just output the original noisy image
.. So how come early stopping of training consistently provides SOTA performance
… and not just in denoising is other image processing problems too such as inpainting
That is the big mystery
The CNN has 2M parameters compared to the images 50K pixels
That means the cost function typicaly has a zero error global minima set of ~1.95M dimensions…. Lots of solutions
…all of which just output the original noisy image
.. So how come early stopping of training consistently provides SOTA performance
… and not just in denoising is other image processing problems too such as inpainting
So let’s try to rethink CNN denoising.
Note:
BM3D also works on a single image and “learns” the structure from the corrupted image itself
DIP and BM3D produce similar results/performance
Idea: May be there’s a link
The key ideas are:
Patch based methods can be thought of as applying a global linear operation using the appropriate matrix W
Composed of a normalized kernel affinity matrix measuring patch similarity, e.g. for NLM
This is not really a linear transform since the filter operation W is itself data dependent.
Standard filtering applies W to the noisy image (average over similar patches)
Alternatively we can iterate the estimate as in Twicing (Tukey) which trades bias for variance. Blurry estimate converges towards the noisy target as t-> inf
As with the DIP, the procedure is early-stopped to avoid overfitting the noise.
So can we relate this to the DIP estimation?
Digging a bit deeper we can see that:
individual weights have negligible change
L2 norm of difference is vanishingly small (hence Taylor expansion)
Similary the pre-activations in the net don’t change significantly and the filters are random
It is currently an open question of when and how well the NTK model actually describes real DNNs…. We will come back to this
Here we can use a really nice result from a recent paper by Jacot and co workers
As a simple example let’s consider a CNN with a single hidden layer.
The rxr convolutions and nonlinearity calculate a non-local patch based affinity matrix with the following kernel
While this patch based similarity metric is different to that of say NLM, if we normalise the patches we can see that it has a very similar form..
(though I have chosen parameters judiciously to maximise the similarity)
Ultimately this means we can directly compute the NTK and do filtering
The NTK matrix for a simple CNN with noisy Baboon as input has the following eigenspectrum
Although NTK matrix is prohibitively large – we can exploit convolutional structure
Also no need to iterate: we can solve directly using Nystrom low rank approx.
Equiv to only computing correlations with 1% of patches
Denoises without training
What we have just described is closer to the Noise2Self model… a sort of autoencoder, which also can denoise a signal image without additional training data
The noisy image is placed as both the input and the target
In Baton’s paper they use a carefully constructed self supervised loss that avoids learning the identity mapping
(which is credited with the success of the method)
However with early stopping it works without this
What we have seen is that even without this such model
Our analysis here is very preliminary… so I will only give a brief overview.
Adam adapts the gradient using running averages… an analytical approximation for this (removing the running averages) is signed gradient descent (not perfect but highlights what we want to show)
With signed GD individual weights still only undergo small changes… but larger than before (Order c^{-3/2} for L_infinity, Order c^{-0.5} for L2)
But L2 change is large – hence Taylor expansion not good approximation
Thus the NTK is NOT constant throughout training.
Following a mean field theory for CNNs we conjecture that the matrix adapts using nonlocal information calculated about the image residual through BP of errors
This results in non constant pre-activations with covariances that can be calculated recursively using similar matrix operators as in the NTK
So let’s examine this ideas in some experiments.
We looked at two CNNs – the U-net as used in DIP, and our vanilla CNN
Here we consider two set ups – the DIP set up (noise as input) and the Noise2Self set up (input image as input)
Note that in each scenario the expressive power of the U-net networks is sufficient to achieve zero training error (i.e. we can fully predict the noisy image in each case)
We also looked at training with Adam and with GD and evaluated these methods on a standard denoising dataset
(Gaussian noise sigma = 25, PSNR = 20.18)
What we see broadly fits our theory:
With image as input both CNNs perform reasonable denoising using either GD or Adam; Adam is better for the U-net;
With Noise as an input we now see a significant difference between GD and Adam training. GD provides poor denoising, while Adam seems to have been able to adapt to the target image data
We also calculated the Nystrom version of the vanilla net and got similar (even slightly better) performance to GD and Adam with image as input
(Nystrom provides additional low rank regularisation that may explain the improvement over the CNN here)
If we now look at what the images looked like we in particular see that GD training with noise as input acts as a crude LPF (particularly for the U-net)
In the other cases: noise+Adam or image+GD/Adam we have a good estimate and certainly have not experienced LPF – all images preserve detail structure
(Sigma = 25, PSNR = 20.18, CBM3D = 33.03)
So in our next experiment we looked at U-net+noise input and the difference between Adam and GD training versus the number of channels
Observations:
Adam has saturated as the number of channels grows – suggests we are not just seeing finite width effects
Similarly when we look at the change in weights we see broadly what we have predicted…
First the L2 change in weights for Adam is order 1 hence not in the NTK regime
In contrast for GD weight change decays roughly as C^{-0.5}
Next looking at the l_infinity norm of the weights we see that individualy all weights have a change that decays with the number of channels, suggesting that each weight provides a similar small contribution to the solution (in contrast with convolutional sparse coding arguments where only a few weights contribute significantly
Finally it is instructive to look at the eigenvectors of the last hidden preactivations
According to NTK theory GD does not modify the distribution of the preactivations duringtraining, hence they remain non-informative (low-pass) with noise at the input (left)
However, they carry non-local information when the image is placed at the input. (right)
The centre pictures show the eigenvectors of Adam trained network with noise at the input and we see similar information to the GD with an image input
So what have we learnt?
We know that CNNs have a very strong bias towards clean images. We have argued that this is due to a natural nonlocal model existing in CNNs
We have also seen that the choice of optimiser plays a key role – this is currently a hot topic in mean field modelling of NNs
The effective degrees of freedom of the network are very different to that of the number of parameters in the net