Robustness in Deep Learning: Single Image Denoising using Untrained Networks

Robustness in Deep Learning: Single Image Denoising
using Untrained Networks
A THESIS
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Esha Singh
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
Master of Science
Ju Sun
May, 2021

Acknowledgements
I would first like to thank my advisor, Professor Ju Sun for providing me with an
opportunity to be a part of his research lab and for his continuous support and guidance.
This thesis work would not have been successful without his able advice, feedback and
teachings.
I would also like to thank Taihui Li for helping me with the experiments, his constant
support, discussions, analysis and feedback, which helped me completion of my work
and improving results.
I would also like to thank the members of my thesis committee, Professor Hyun Soo
Park and Professor Gilad Lerman.
Finally, my deep and sincere gratitude to my family and friends for their uncondi-
tional and unparalleled love and support.
i

Dedication
To my mother and father, friends, and colleagues who have mentored and held me up
along the way.
ii

Abstract
Deep Learning has become one of the cornerstones of today’s AI advancement and
research. Deep Learning models are used for achieving state-of-the-art results on a wide
variety of tasks, including image restoration problems, specifically image denoising.
Despite recent advances in applications of deep neural networks and the presence of a
substantial amount of existing research work in the domain of image denoising, this task
is still an open challenge. In this thesis work, we aim to summarize the study of image
denoising research and its trend over the years, the fallacies, and the brilliance. We
first visit the fundamental concepts of image restoration problems, their definition, and
some common misconceptions. After that, we attempt to trace back where the study
of image denoising began, attempt to categorize the work done till now into three main
families with the main focus on the neural network family of methods, and discuss some
popular ideas. Consequently, we also trace related concepts of over-parameterization,
regularisation, low-rank minimization and discuss recent untrained networks approach
for single image denoising, which is fundamental towards understanding why the current
state-of-art methods are still not able to provide a generalized approach for stabilized
image recovery from multiple perturbations.
iii

Contents
Acknowledgements i
Dedication ii
Abstract iii
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Image Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Image Restoration Problem Formulation . . . . . . . . . . . . . . 7
2.3 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 ResNets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Noise models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
iv

2.6 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.1 MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.2 PSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.3 SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Image Denoising Algorithms: Review 15
3.1 Spatial domain methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Spatial domain filtering . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Variational denoising methods . . . . . . . . . . . . . . . . . . . 16
3.2 Transform domain methods . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Data adaptive methods . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Non Data adaptive methods . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 Block-matching and 3D filtering: BM3D . . . . . . . . . . . . . . 19
3.3 Deep Neural Network methods . . . . . . . . . . . . . . . . . . . . . . . 20
4 Deep Image Prior 22
4.1 Image Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Deep Image Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Important Results . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Limitations of Deep Image Priors . . . . . . . . . . . . . . . . . . . . . . 28
5 Rethinking Single Image Denoising 30
5.1 Over-parameterisation in deep learning . . . . . . . . . . . . . . . . . . . 30
5.1.1 Overparameterisation v/s over-fitting? . . . . . . . . . . . . . . . 31
5.1.2 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Low-rank matrix recovery problem . . . . . . . . . . . . . . . . . . . . . 32
5.3 Rethinking Single Image denoising: Main Ideas . . . . . . . . . . . . . . 33
5.3.1 Image denoising via Implicit Bias of Discrepant Learning Rates . 34
5.3.2 Implicit Rank-Minimizing Autoencoder . . . . . . . . . . . . . . 36
5.4 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
v

6 Preliminary Experiments 39
6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Hyper-parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.4 Results and Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7 Conclusion and Discussion 52
References 54
Appendix A. Glossary and Acronyms 68
vi

List of Tables
A.1 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
vii

List of Figures
1.1 Performance of existing medical image denoising methods in removing
image noise. (a) Noisy input, (b) Result obtained by BM3D [1], (c)Result
obtained by DnCNN [2]. Source by: (https://www.kaggle.com/mateuszbuda/
lgg-mri-segmentation). . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Denoising results for Chest X-Ray Dataset [3] for Gaussian noise with
standard deviation of 25. Left to Right (a) noisy image (b) denoisied
image using an unsupervised cleaning. Source: Shamshad et al. [4]. . . . 4
2.1 ANN inspired by biological neural networks. The inputs are denoted
by x1, x2...xn at the dendrites and outputs by y1, y2, ...yn at the axon
terminal ends. Source: Wikipedia[5]. . . . . . . . . . . . . . . . . . . . . 6
2.2 Inverse problem. Source: caltech GE193[6] . . . . . . . . . . . . . . . . . 7
2.3 Residual learning: a building block. (Source: He et al. [7]) . . . . . . . . 8
2.4 An example of an autoencoder. The input image is encoded to a com-
pressed representation and then decoded. (Source: Bank et al. [8]) . . . 11
3.1 Scheme of the BM3D algorithm. (credits: Marc Lebrun [9]) . . . . . . . . . . 19
4.1 Image space visualization for DIP. Assume the problem of reconstructing
an image xgt from a degraded measurement x0. The image exemplified by
denoising, the ground truth xgt has non-zero cost E(xgt, x0) > 0. Here,
if run for long enough, fitting with DIP will acquire a solution with near
zero cost quite distant from xgt. However, often the optimization path
will pass close to xgt, and an early stopping (here at time t3) will recover
good solution. Source: Ulyanov et al. [10] . . . . . . . . . . . . . . . . . 25
viii

4.2 Figure depicting image restoration process using DIP. Starting from a
random weight θ0, one must iteratively update them in order to minimize
the data term eq. (4.3). At every iteration t the weights θ are mapped
to an image x = fθ(z), where z is a fixed tensor and the mapping f is a
neural network with parameters θ. The image x is used to calculate the
task-dependent loss E(x, x0). The loss gradient w.r.t. the weights θ is
then calculated and used to update the parameters. Source: Ulyanov et
al. [10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1 Architecture used in [10] and also the base architecture for You et al. [11].
The hourglass (also known as decoder-encoder architecture. It sometimes
has skip connections represented in yellow. nu[i], nd[i], ns[i] correspond
to the number of filters at depth i for the upsampling, downsampling,
and skip-connections respectively. The values ku[i], kd[i], ks[i] correspond
to the respective kernel sizes. Source: Ulyanov et al. [10] . . . . . . . . . 36
6.1 From top left to bottom right: (a) The images in top row show ground
truth image Lena and (b) its noisy counterpart using 60% corruption
level for salt and pepper noise. The bottom row images show (c) Real
image same as (a), (d) noisy image same as (b) and the, (e) reconstructed
image using our approach. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 From top to bottom: (a) The image in top shows PSNR plot for cor-
responding Figure 6.1, (b) loss plot (L1 loss) for reconstruction process
using our approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 From top left to bottom: (a) The images in top row show ground truth
image F16-GT and (b) its noisy counterpart using 80% corruption level
for salt and pepper noise. The bottom row image shows (c) the recon-
structed image using our approach. The best PSNR achieved is 21.8671.
As we can see for higher corruptions we get poorer performance. . . . . 44
structed image using [10] DIP-l1 approach. The best PSNR achieved is
28.548 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
ix

6.5 From top left to bottom right: (a) The images in top row show ground
truth image F16-GT and (b) its noisy counterpart using 50% corruption
level for salt and pepper noise. The bottom row image left shows (c)
original image same as (a), (d) shows noisy image same as (b) and, (e) is
the reconstructed image using our approach. The best PSNR achieved is
29.2449 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
structed image using You et al. [11] approach (width = 128). The best
PSNR achieved is 28.9 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.7 From top left to bottom: (a) The images in top row show ground truth im-
age F16-GT and (b) its noisy counterpart using 50% corruption level for
salt and pepper noise. The bottom row image shows (c) the reconstructed
image using You et al. [11] approach (width = 128) and ADAM optimiser
(unless mentioned optimizer is SGD). The image reconstructed becomes
noisier as the training continues because the network starts learning noise
in absence of early termination. . . . . . . . . . . . . . . . . . . . . . . . 48
6.8 (a) The image shows PSNR plot for corresponding Figure in 6.7. It clearly
demonstrates the need for early termination. The best PSNR achieved
is 28.14 dB before the dip. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.9 (a) The image shows PSNR plots for different corruption levels for each
of the three methods discussed in last chapter. The line plots show best
PSNR levels across models for different corruption rates. The plot sup-
ports our claim in observations that with increasing corruption rate the
best PSNR level reached by our all three methods decreases. Also, our
method is the most consistent and stable in performance amongst all
three methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
x

6.10 The plot depicts the effect of number of linear layers for our method when
the number goes from 3 to 9. For three independent trials, the average
performance is depicted via the dashed line and we see that the number
of epochs to reach the highest PSNR value decreases when we increase
the number of layers from 3 to 6. From 6 to 9 layers we see a slight
increase. The corruption level for all three trials is 50%. . . . . . . . . . 51
xi

Chapter 1
Introduction
Artificial Intelligence (AI) - is a phenomenon where intelligence is demonstrated by ma-
chines, unlike the natural intelligence exhibited by humans, which involves consciousness
and emotionality. For two decades, AI has been at the helm of revolutionary changes
in the face of industrial, academic research, and development. The past decade, and
notably the past few years, has been transformative for AI, not so much in terms of
what we can do with this technology (theoretical) as what we are doing with it (applied)
[12].
The main ideology behind AI has been the perfect emulation of human intelligence
and an attempt to give them coherent reasoning. Can machines gain common sense?
There are two fundamental paradigms to this problem. One is to provide a compre-
hensive set of facts and rules encoding human knowledge (an undertaking by the Cyc
Project since 1984 [13]). The other being facilitating the self-learning process of ma-
chines, similar to how humans develop commonsense. The latter approach has shown
great promise, with Deep Learning (DL) being the dominant tool to help machines gain
perception of the world around them. With the abundance of work that exists in this
sphere, it is not difficult to experience the power as well as various limitations of this
technology. The lack of generality, data bias, fairness, and robustness to unforeseen
situations are some of the well-known challenges in this field. The aim of this thesis
work is to focus on one such particular challenge; Robust image recovery. It is a relevant
issue that is still an open problem and is omnipresent in real-life situations. Self-driving
vehicles, digital photography, medical image analysis, remote sensing, surveillance, and
1

2
digital entertainment are a few of the applications where due to unprecedented suscep-
tibilities, existing solutions might not perform as expected. Robustness against natural
corruptions or robustness in medical problems are some of the non-trivial open chal-
lenges in the sphere of robustness in deep learning, and to tackle such big issues it is
advantageous to break them into smaller sub-tasks. Thus, a small step towards handling
those situations is to solve the classical yet active problem of robust image recovery un-
der synthetic noise models. If one can solve this problem reliably, it can give us insight
as to how to work our way towards the more significant hurdles.
One of the rudimentary challenges in the field of image processing and computer
vision is image denoising, where the underlying goal is to approximate the actual image
by suppressing noise from a noise-contaminated version of the image. Image noise may
be caused by several intrinsic (i.e., sensor) and extrinsic (i.e., environment) conditions
which are often not possible to avoid in practical situations. Image denoising is a funda-
mental yet active problem and still remains unsolved because noise removal introduces
artifacts and unwanted effects such as blurring of the images.
The focus of this thesis work is to summarize the fundamental concepts behind image
denoising tasks, existing work in the field, qualitative analysis of state-or-art methods
for this tasks and finally, present a probable approach with supportive arguments and
preliminary experiments undertaken during the course of this research work.
1.1 Application
Digital images play an essential role both in daily life applications such as satellite
television, medical imaging application, computer tomography, as well as in areas of
research and technology such as geographical information systems and astronomy. So it
is not difficult to gauge the importance of recovering precise images. It is the first and
vital step before images can be analyzed or used further. Thus, image denoising plays a
vital role in a wide range of applications such as image restoration, image registration,
visual tracking, image segmentation, and image classification, where obtaining the orig-
inal image content is crucial for strong performance. It is important to develop effective
denoising techniques in order to compensate for data corruption which is introduced
when data is collected by imperfect instruments which are generally contaminated by

3
noise, issues with the data acquisition process [14], and interceding natural phenomena
[15].
An important practical application for image denoising is in Medical Sciences. Med-
ical images obtained from MRI are the most common tool for diagnosis in Medicine and
are often influenced by random noise arising in the image acquisition process. Hence,
noise removal is essential in medical imaging applications in order to enhance and recover
fine-grained details that may be hidden in the data.
Medical imaging including X-rays, Magnetic Resonance Imaging (MRI), Computer
Tomography (CT), ultrasound, etc., are susceptible to noise due to reasons discussed in
the last section. Hence, it important to recover original, high-quality, noiseless images.
Image denoising in the field of Medicine is referred to as Medical image denoising and
is a process of improving the perpetual quality of degraded noisy images captured with
specialized medical image acquisition devices [16]. Figure 1.1 is an example of how
existing MID methods illustrates deficiencies in large-scale noise removal from medical
images and immensely fail in numerous cases [16].
Another use case for medical imaging applications is with respect to X-ray images.
X-ray images provide crucial support for diagnosis and decision-making in several diverse
clinical applications. However, X-ray images may be corrupted by statistical noise, thus
gravely deteriorating the quality and raising the difficulty of diagnosis [17][4]. Therefore,
X-ray denoising is mandatory for improving the quality of raw X-ray images and their
relevant clinical information content and analysis.
Figure 1.1: Performance of existing medical image denoising methods in removing image
noise. (a) Noisy input, (b) Result obtained by BM3D [1], (c)Result obtained by DnCNN
[2]. Source by: (https://www.kaggle.com/mateuszbuda/lgg-mri-segmentation).

4
Figure 1.2: Denoising results for Chest X-Ray Dataset [3] for Gaussian noise with
standard deviation of 25. Left to Right (a) noisy image (b) denoisied image using an
unsupervised cleaning. Source: Shamshad et al. [4].
1.2 Thesis Overview
The rest of thesis is organized as follows:
• Chapter 2 briefly presents the basic concepts and terminologies used in image
restoration studies which are used throughout the thesis.
• Chapter 3 presents a comprehensive survey of denoising algorithms developed till
2017.
• Chapter 4 describes the Deep Image Prior concepts, it’s limitations and related
work developed over the DIP ideology.
• Chapter 5 presents an alternative perspective towards image denoising with the
help of two important ideas which are discussed in detail. Finally, the proposed
methodology is introduced.
• Chapter 6 hashes out the experimental setup and analysis for the proposed single
image denoising methodology.
• Chapter 7 presents the conclusion and discusses some future work directions.

Chapter 2
Background
In this chapter, we briefly summarize the fundamental concepts and definitions pivotal
to understanding the rest of the thesis work and which we might revisit them frequently
in further chapters.
2.1 Deep Learning
Deep Learning is a sub-domain under the umbrella of machine learning concerned with
algorithms, which aims to imitate the structure and functionality of the human brain
called artificial neural networks (ANNs) with representation learning [18]. More specif-
ically, a neural network is inspired by a neuron, and in machine learning, it is an infor-
mation processing technique that uses the same concept of biological neural networks
but not identical to it [19], the analogy shown in Figure 2.1. There are multiple deep
learning architectures such as deep neural networks (DNNs), recurrent neural networks
(RNNs), and convolutional neural networks (CNNs) that have been applied to various
fields, including computer vision, machine vision, speech recognition, natural language
processing, audio recognition, social network filtering, machine translation ,and bioinfor-
matics, where they have produced remarkable results comparable or surpassing human
expert performance [18]. The ”learning” in the terminology of ”deep learning” can be
either supervised, unsupervised or semi-supervised, whereas the term ”deep” refers to
the number of layers through which the input is transformed.
5

6
Figure 2.1: ANN inspired by biological neural networks. The inputs are denoted by
x1, x2...xn at the dendrites and outputs by y1, y2, ...yn at the axon terminal ends. Source:
Wikipedia[5].
2.2 Inverse Problems
Inverse problem is a procedure of calculating from a set of observations the causal
factors that produced them: for example calculating the density of the Earth from
measurements of its gravity field [20]. If the information given by a measurement is
incomplete (incorrect or improper), then a problem is ill-posed [21]. Thus, Inverse
problems try to quantify when a problem is ill-posed and to what degree, and extract
maximum information under practical circumstances. Inverse problems occur in many
applications, such as image denoising, image deblurring, inpainting, super-resolution
etc. [22]. Figure 2.2 depicts a forward problem with respect to an inverse problem.
2.2.1 Image Denoising
Image denoising refers to removal of noise from a noisy image, so as to restore the true
image. The aim is to recover meaningful information from noisy images in the process
of noise removal to obtain high quality images is, which is an important open research
problem. The primary reason for this is that from a mathematical perspective, image
denoising is an inverse problem and its solution is not unique.
Also, Image restoration and image denoising are different terminologies where image

7
Figure 2.2: Inverse problem. Source: caltech GE193[6]
denoising is a type of image restoration problem. There are several types of image
restoration problems for example: super resolution, inpainting and image denoising is
also one of them. Therefore, algorithms that solve image restoration problems will also
be applicable for image denoising problems. The work in this thesis is centered around
image denoising.
2.2.2 Image Restoration Problem Formulation
The problem of image restoration can be traditionally formulated as [23] -
y = Hx + c (2.1)
where x ∈ Rn
represents the unknown original image, y ∈ Rm
represents observa-
tions and H is an m×n degradation matrix and c ∈ Rm is a vector of i.i.d (independent
and identically distributed) Gaussian random variables with mean as zero and standard
deviation of σc. Thus, as explained above the Equation 2.1 can represent different image
restoration problems. It can represent an image denoising problems when H is the n×n
identity matrix In. It depicts image in-painting when H is a selection of m rows of In
and image deblurring when H is a blurring operator [23].

8
2.3 Deep Neural Networks
A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers
between the input and output layers [24][25][18]. DNNs, which employ deep architec-
tures in NNs can represent functions that have higher complexity if the units in a single
layer and numbers of layers are increased [26]. If enough labeled training datasets and
suitable models are given, deep learning approaches can help humans establish mapping
functions for operation convenience. In this work, which pivots around understanding
the theoretical foundations of one such deep learning network, we will mention the use
of CNNs and ResNets. The latter is detailed in further sections below.
2.3.1 ResNets
ResNet, short for Residual Network is a specific type of neural network that was intro-
duced in 2015 by He et al., 2015. [7] they introduced a residual learning framework to
ease the training of networks that are substantially deeper than those used prior to 2015
and had explicitly reformulated the layers as learning residual functions with respect to
the layer inputs, instead of learning unreferenced functions [7].
Figure 2.3: Residual learning: a building block. (Source: He et al. [7])
The popularity of ResNets stems from that fact that it solved a big open challenge.
When deeper networks are able to start converging, a degradation problem had been
exposed: with the network depth increasing, accuracy gets saturated and then drops
rapidly [7]. Surprisingly, such degradation is not caused by overfitting. Adding more
layers to an appropriately deep model leads to higher training error, as reported in

9
[27][28]. In 2015, He et al. [7] empirically showed that there is a maximum threshold
for depth with the traditional CNN model.
Hence, [7] solved the degradation problem by introducing a deep residual learning
framework where a basic unit is called a residual block as depicted in Figure 2.3. Instead
of assuming that each of few stacked layers directly fit a desired underlying mapping,
they explicitly let these layers fit a residual mapping shortcut. Practically, this idea was
realised by feed-forward neural networks with connections that skip one or two layers
[7][29][30] called “shortcut connections”. These shortcut connections simply perform
identity mapping, and their outputs are added to the outputs of the stacked layers
(Figure 2.3). One of the biggest advantage of this approach is that these identity
shortcut connections add neither extra parameter nor computational complexity. Thus,
ResNets are heavily used to solve a variety of modern day tasks.
2.3.2 GANs
Ian J. Goodfellow et al., 2014 [31] proposed a new generative model estimation proce-
dure - an adversarial nets framework, where the generative model is pitted against an
adversary - a discriminative model that learns to ascertain whether a sample is from the
model distribution or the data distribution. This is analogous to a team of counterfeits
trying to produce some fake items and police who is trying to detect the counterfeit
items. The competition in this game drives both teams to improve their methods until
the counterfeits are indistinguishable from the genuine articles. Thus, two neural net-
works compete with each other (in the form of a zero-sum game, where one network’s
gain is another network’s loss) [32] and this model architecture or framework is called
Generative adversarial Network (GANs). More formally, the GAN model architecture
involves two sub-models; a generator model for generating new examples and a discrim-
inator model for classifying whether generated examples are real, from the domain, or
fake, generated by the generator model [33]. Generator model is used to generate new
plausible examples from the problem domain. Discriminator is a model that is used to
classify examples as real (from the domain) or fake (generated).
Although, originally proposed as a form of generative model for unsupervised learn-
ing, GANs have also proven effective for semi-supervised learning [34], fully supervised
learning [35], and reinforcement learning [36]. A more standardized approach for GAN

10
framework called Deep Convolutional Generative Adversarial Networks, or DCGAN,
that led to more stable models was later formalized by Alec Radford, et al. [37] in 2015.
2.4 Autoencoder
An autoencoder is a special type of neural network (NN), which is mainly designed
to encode the input into a compressed, meaningful representation, and then used to
decode it back such that the reconstructed input is as comparable as possible to the
original one [8]. They are an unsupervised learning technique where we leverage neural
networks for the task of representation learning. Specifically, one designs a neural
network architecture such that they can impose a bottleneck in the network which
forces a compressed knowledge representation of the original input. If the input features
were each independent of one another, this compression and subsequent reconstruction
would be a very difficult task. However, if some sort of structure exists in the data (i.e.
correlations between input features), this structure can be learned and consequently
leveraged when forcing the input through the network’s bottleneck [38]. They have
been first introduced in the 1980s by the Hinton and the PDP group [39] as a NN
that is trained to reconstruct its input. Mathematically, their main task of learning an
”informative” representation of data that can be used for various implications can be
formally defined [8][40] as to learn functions A : Rn → Rp and B : Rp → Rn that satisfy:
argminA,B E[∆(x, B ◦ A(x))] (2.2)
where E is the expectation over the distribution of x, and ∆ is the reconstruction loss
function, that measures the distance between the output of the decoder and the input.
The loss function is usually set to be the l2-norm [40].
Usually, A and B are neural networks [41]. But for the special case that A and B are
linear operations, it is called a linear autoencoder [42][40]. If in a linear autoencoder we
also drop the non-linear operations, then this autoencoder would attain the same latent
representation as Principal Component Analysis (PCA) [43]. Therefore, an autoencoder
is a generalization of PCA, where instead of finding a low dimensional hyperplane in
which the data is found, it is able to learn a non-linear manifold [44]. Thus, While con-
ceptually simple, autoencoders are quite popular and play an important role in machine

11
learning. For training autoencoders, it can be done gradually layer by layer or they can
be trained end-to-end. In the layer by layer case (or latter case), they are ”stacked”
together, that leads to a deeper encoder. In [45], this is done with convolutional autoen-
coders, and in [46] with denoising autoencoder. We will revisit autoencoders in chapter
5 and 6.
Figure 2.4: An example of an autoencoder. The input image is encoded to a compressed
representation and then decoded. (Source: Bank et al. [8])
2.5 Noise
Image noise is random variation of color information or brightness in images, and is
generally an aspect of electronic noise [47]. It tells unwanted information in digital
images and obscures the desired information. Noise produces undesirable effects such
as artifacts, unrealistic edges, unseen lines, corners and blurred objects. There are
multiple sources of noise in images, and these noises come from various aspects such as
image acquisition, transmission, and compression [48].
2.5.1 Noise models
There are different types of noise models, but we mention only three popular noise
models: Gaussian, salt-pepper and uniform noise. Also, there are different processing
algorithms for different types of noise models. For any input image, we model noisy
image for additive noise as -
g(x) = I(x) + v(x) (2.3)

12
I(x) is the original image without any noise, v(x) is the additive noise model and g(x)
is the input image with noise. x is set of pixels in the input image.
1. Gaussian Noise: Gaussian noise generally happens in the analog signal in the
electronics of the camera. It can be modeled as additive noise and acts on the
input image I to produce a degraded image y :
y = I + ση η ∼ N(0, 1) (2.4)
where σ is standard deviation [49][50][51]. Example of denoising algorithm for this
type of noise: Gaussian filtering.
2. Salt and pepper Noise: this impulse noise corresponds to random pixels which are
either saturated or turned off. It can happen in equipment with electronic spikes,
and we can model this as:
y =



I with probability p
b with probability 1-p
(2.5)
where b ∼ Ber(0.5) is a Bernoulli variable of parameter 0.5. Algorithms used for
image recovery from this type of noise - median filtering, mean filtering [52][47].
3. Shot noise: or the photon shot noise is the dominant noise in the brighter parts
of an image from an image sensor and is typically caused by statistical quantum
fluctuations, i.e., disparity in the number of photons observed at a given exposure
level [47]. The root-mean-square value of shot noise is proportional to the square
root of the image intensity, and the noises at different pixels are not related to one
another. This noise model follows a Poisson distribution, which except at very
high intensity levels approximates a Gaussian distribution.
4. Speckle noise: is a granular noise that exists inherently in an image and corrupts
its quality. This noise can be generated by multiplying random pixel values with
different pixels of an image [48]. A fundamental challenge in optical and digital
holography is the presence of speckle noise in the image reconstruction process.

13
2.6 Evaluation metrics
The aim of a denoising algorithm is to recover the original image as much as possible from
its noise-corrupted version. To evaluate denoising algorithms, different image quality
assessment measurements have been adopted to compare the denoised estimation and
ground truth high-quality images. Below, three popular representative, quantitative
measurements are discussed, amongst which PSNR is most commonly used metric.
2.6.1 MSE
Mean Squared Error (of a process for estimating an unobserved quantity) of an estimator
measures the average squared difference between the actual and estimated values. MSE
is equivalent to the expected value of the squared error loss and signifies the quality of
an estimator. It is always non-negative, and values closer to zero are better.
For a given noise-free m × n monochrome image I and its noisy approximation say
K, mathematically MSE can be defined as [53]-
MSE =
1
mn
m−1
X
i=0
n−1
X
j=0
[I(i, j) − K(i, j)]2
(2.6)
2.6.2 PSNR
Peak signal-to-noise ratio (PSNR) is a term that signifies the ratio between the max-
imum power of a signal and the power of contaminating noise that affects the fidelity
of its representation [53]. PSNR is defined via the mean squared error (MSE). Given
the ground truth image I and denoised estimation K, based on MSE, the definition of
PSNR is:
PSNR = 10 log10
MAX2
I
MSE

(2.7)
In the above equation, MAXI is the maximum possible pixel value of the image.
This value is 255 when the pixels are represented using 8 bits per sample.
While both MSE and PSNR are well accepted and are heavily used in several appli-
cations, they are not associated (or correlated) well with the visual perception of human
vision system, which is highly non linear and complex [54][55][56]. Thus, they are not

14
a good fit to measure the perceptual similarity between two images. Yet, PSNR is still
the most commonly used index to compare two images.
2.6.3 SSIM
Besides the MSR and PSNR, perceptual quality measurements have also been proposed
to evaluate denoising algorithms. One of the representative measurements is the struc-
tural similarity (SSIM) index [57]. The SSIM is a procedure for estimating the perceived
quality of digital television and cinematic pictures, as well as other kinds of digital im-
ages and videos. This metric is used for measuring the similarity between two images
[58].
The SSIM index can be calculated on various windows of an image. The measure
between two windows x and y of common size N × N is [58]:
SSIM(x, y) =
(2µxµy + c1)(2σxy + c2)
(µ2
x + µ2
y + c1)(σ2
x + σ2
y + c2)
(2.8)
where µx, µy are averages of x and y, σx, σy are variance of x and y, σxy co-variance,
c1 = (k1L)2, c2 = (k2L)2, c2 = (k2L)2 two variables to stabilize the division with weak
denominator, L is the dynamic range of the pixel-values, k1 = 0.01 and k2 = 0.03 by
default [57][58].
Above SSIM formula is based on three comparison measurements between the sam-
ples x and y; luminance (l), contrast (c), and structure (s).
l(x, y) =
2µxµy + c1
µ2
x + µ2
y + c1
(2.9)
c(x, y) =
2σxσy + c2
σ2
x + σ2
y + c2
(2.10)
s(x, y) =
σxy + c3
σxσy + c3
(2.11)
where c3 = c2/2. thus, using above 3 definitions, equation 2.3 can be rewritten as
(reference: [58]:
SSIM(x, y) =

l(x, y)α
· c(x, y)β
· s(x, y)γ

(2.12)
with setting weights α = β = γ = 1 to obtain similar form as equation 2.3.

Chapter 3
Image Denoising Algorithms:
Review
In this chapter we attempt to capture the research work and methods developed till
now for the open challenge of image denoising.
There exists several ways to classify existing image denoising algorithms. The three
popular approaches to classify them are:
• Inspired from image processing field concepts [59] - Spatial domain, Transform
domain and neural network (NN) based methods
• based on popular families in image restoration methods [10] - learning-based meth-
ods and learning-free methods
• based on how the an image prior is exploited to generate high-quality estimation
wrt to an input image [57]- Implicit and Explicit methods
Selecting the more intuitive categorization of the three, we will classify existing
denoising algorithms using spatial domain, transform domain and NN based methods.
Furthermore, we discuss the prior work in chronological order and as the classical spatial
and transform domain based algorithms have been thoroughly reviewed in previous
papers [15], [60] hence, we focus more on recently proposed NN based algorithms.
15

16
3.1 Spatial domain methods
Spatial domain technique is a traditional denoising method. It is a technique that is
directly applied to images in the form of spatial filters for noise removal [61]. Spatial
domain methods can be further sub-categorized into - spatial domain filtering (SDF)
and variational denoising methods [59].
3.1.1 Spatial domain filtering
Spatial domain filtering methods can also be grouped as implicit methods as per point
(2) above, and they can be divided into further two classes - linear and non-linear
filtering. Linear filters tend to blur sharp edges, destroy lines and other fine image
details, and perform poorly in the presence of signal-dependent noise [59]. For example,
a mean filter (linear filer) is optimal for Gaussian noise in the sense of mean square
error, but it tends to over-smooth images with high noise. Wiener filter was introduced
to combat this advantage, but it also can easily blur sharp edges.
Whereas by using non-linear filters, such as median filtering [62][63] and weighted
median filtering [64], noise can be suppressed without any identification. For example,
Bilateral filtering [65] is widely used for image denoising As it is a non-linear, edge-
preserving, and noise-reducing smoothing filter.
SDF methods, in general, adopt priors of high-quality images implicitly, where the
priors are ingrained into specific restoration operations. Such an implicitly modeling
strategy was used in most of the early years’ image denoising algorithms, some of which
discussed above [65][66][67][57].
Based on the assumptions of suprior quality images, heuristic operations have been
designed to generate estimations directly from the degraded images. For example, based
on the smoothness assumption, filtering-based methods.
3.1.2 Variational denoising methods
Besides implicitly embedding priors into restoration operations, variational denoising
methods explicitly characterize image priors and subsequently use the Bayesian method
to produce high quality reconstruction results. Having the degradation model p(y|x)
and specific prior model p(x), different estimators can be used to estimate latent image

17
x. One popular approach is the maximum a posterior (MAP) estimator, where
x̂ = arg max
x
p(x|y) = arg max
x
p(y|x)p(x) (3.1)
(using Bayes theorem), with which we seek, given the corrupted observation and
prior, the most probable estimation of x. In the case of AWGN (Additive white Gaus-
sian Noise), the above equation 3.1 can be reformulated as an objective function as a
summation of a data fidelity term (least squares) and a regularizer (more details for the
same discussed in ahead chapters).
Thus, for the variational denoising methods, the key is to find a suitable image
prior for example, some successful prior models include gradient priors, non-local self-
similarity (NSS) priors, sparse priors, and low-rank priors [59].
Total variation (TV) regularization
TV regularization is defined on the statistical fact that natural images are locally
smooth, and the pixel intensity gradually varies in most regions [59]. TV regularization
uses a Laplacian distribution to model image gradients, resulting in an l1 norm penalty
on the gradients of the estimated image. Mathematically it can be defined it as [68] :
RTV (x) = ||∇x||1 where ∇x is gradient of x.
Total variation is one of the most extensively used image priors that promotes spar-
sity in image in image gradients, effectively calculate the optimal solution and can also
retain sharp edges. Although, it has been shown to be beneficial in a number of applica-
tions [69][70][71] and is one of the most notable methods for image denoising, it has few
limitations. The 3 main disadvantages are that 1) textures tend to be over-smoothed 2)
flat areas are approximated by a piece-wise constant surface resulting in a stair-casing
effect and 3) the resultant image suffers from losses of contrast [72][68][73][59].
Extensive studies have been conducted to improve on TV regularizer performance in
image smoothing by adopting partial differential equation while some have also proposed
wavelet filters for analysis sparse filters [74][75]. Beck et al. [76] proposed a fast gradient-
based method for constrained TV, which is also a generic framework for covering other
types of non-smooth regularizers. [77] proposed a statistical model to structure the
heavy tailed distribution of coefficients via robust penalty function - lp norm and [78]
introduced a normalised sparsity measure.

18
3.2 Transform domain methods
Conversely, with spatial domain filtering methods, transform domain (TD) filtering
methods first transform the given noisy image to another domain and then apply a
denoising procedure on the transformed image based on the different image and noise
characteristics (larger coefficients denote the high-frequency part, i.e., the details or
edges of the image, whereas smaller coefficients denote the noise). These methods
operate on an important observation that the characteristics of image information and
noise are different in the transform domain. Furthermore, the transform domain filtering
methods can be subdivided based on the chosen basis transform functions - it may be
data-adaptive or non-data adaptive [79].
3.2.1 Data adaptive methods
Examples of data-adaptive methods are Independent component analysis (ICA) [80],
86] and PCA [81][82] functions. Both of them are adopted as the transform tools on the
given noisy images. One of the main disadvantages of data-adaptive methods is that
they have high computational costs because they use sliding windows and need a sample
of noise-free data or at least two image frames from the same scene [59]. However, it is
entirely possible that in some applications, it might be challenging to obtain noise-free
training data.
3.2.2 Non Data adaptive methods
The non-data adaptive TD filtering methods can additionally be subdivided into two
domains - spatial-frequency domain and wavelet domain. We will not discuss the spatial-
frequency domain in this work. But we will briefly mention the wavelet transform below
because it is one of the most researched transform techniques. [83] breaks down the
input data into a scale-space representation. Also, it has been proved that wavelets can
successfully remove noise while preserving the image characteristics, regardless of its
frequency content [84][85][86][87][59].

19
3.2.3 Block-matching and 3D filtering: BM3D
Bm3D [1] is a non-local, adaptive non-parametric filtering image denoising strategy
based on an enhanced sparse representation in the transform domain. The enhancement
of the sparsity is attained by grouping similar 2D fragments of the image into 3D
data arrays, which are ”groups”. To deal with these 3D groups, a special procedure
called collaborative filtering is used. This procedure includes three consecutive steps:
the 3D transformation of a group, shrinkage of transform spectrum, and inverse 3D
transformation. Thus, the 3D estimate of the group is obtained, which consists of an
array of jointly filtered 2D fragments. Due to the similarities between the grouped
blocks, the transform can achieve a highly sparse representation of the true signal so
that the noise can be well separated by shrinkage. Thus, collaborative filtering exposes
even the finest details shared by grouped fragments, and at the same time, it preserves
the important unique features of each individual fragment [1][88].
Figure 3.1: Scheme of the BM3D algorithm. (credits: Marc Lebrun [9])
This is one of the most popular, powerful, and effective denoising methods and
has been state-of-art until recently. After the original work of [1], many improved
versions of BM3D were also developed [89][90]. [90] proposed the block-matching and
4D filtering (BM4D) method. Dabov et al., 2007 [91] proposed an improvement of the
BM3D method for color image denoising that exploits filtering in a highly sparse local
3D transform domain in each channel of a luminance-chrominance color space. Further,
many follow-up works combined the sparse prior and NSS prior [92]. [93] collected non-
local similar patches to solve the group-sparsity problem to achieve better denoising
results. [94][57] proposed a non-local centralized sparse representation model in which

20
the mean value of the representation coefficients is pre-estimated based on the patch
groups.
3.3 Deep Neural Network methods
The original deep learning technologies were first used in image processing in the 1980s
[95] and were first used in image denoising by Zhou et al. [96][97][98]. After that, a
feed forward network was used to reduce the high computational costs and to make a
trade-off between denoising efficiency and performance [99]. The feed-forward network
can smooth the given corrupted image by Kuwahara filters [100], which were similar to
convolutions. Although these techniques were effective, these networks did not allow
the addition of new plug-and-play units, which restricted their generalization abilities
and usage in practical applications.
To overcome these limitations, Convolutional neural networks [101] were proposed.
Although, they had a slow start since their introduction due to a number of then ex-
isting issues like vanishing gradients problem, activation functions such as sigmoid and
tanh’s high computation cost, lack of appropriate hardware for efficient computations,
etc. But after the inception of AlexNet in 2012, things changed, and deep network
architectures were widely applied in fields of video, natural language processing, speech
processing, etc. In recent years, several CNN-based denoising methods have been pro-
posed [102][103][104][2][105][59]. Compared to that of [106], the performance of these
methods has been greatly improved. Furthermore, numeral network based (or CNN-
based) denoising methods can be divided into two categories: multilayer perceptron
(MLP) models and deep learning methods [59]. We discuss both the categories briefly.
A multilayer perceptron (MLP) is a class of feed-forward artificial neural networks
(ANN) [107]. Some popular MLP-based image denoising models include auto-encoders
proposed by Vincent et al. [103] and [104]. Chen et al. [102] introduced a feed-forward
deep network called the trainable non-linear reaction diffusion (TNRD) model, which
achieved a better denoising effect. In general, MLP-based methods are beneficial as they
work efficiently owing to fewer inference procedure steps. Moreover, because optimiza-
tion algorithms [108] have the ability to derive the discriminative architecture, these
methods have better interpretability. Although, on the other hand, interpretability can

21
increase the cost of performance [2]. [109] presented a patch-based denoising algorithm
that is learned on a large dataset with a MLP. Results on additive white Gaussian
(AWG) noise were competitive but the method had it’s limitations as generalisation
performance for different noise models was not competitive.
Deep networks were first applied to image denoising tasks in 2015 [110][111]. They
[110] pretrained stacked denoising auto-encoder, and to prevent co-adapting between
units, they applied dropout. Thus, the combination of dropout and stacked auto-encoder
enhanced performance and time-reduction in fine-tune phase. For addressing multiple
low-level tasks via a model, a denoising CNN (DnCNN) [2] consisting of convolutions,
batch normalization (BN) [112], rectified linear unit (ReLU) [113] and residual learning
(RL) [7] was proposed to deal with image denoising and, other image restoration tasks.
Therefore, CNN-based denoising methods were a success, and the reason is attributed
to their large modeling capacity and tremendous advances in network training, and
design. However, discriminative denoising methods at that time (2018) were limited
in flexibility, and the learned model was usually tailored to a specific noise level, and
methods like [2] did not generalize well for noise models other than AWGN on which
they were trained on.
To mitigate the above said limitations, [114] introduced a fast and flexible denoising
CNN (FFDNet) which presented different noise levels and the noisy image patch as the
input of a denoising network and improved denoising speed and process blind denoising.
A generative adversarial network (GAN) CNN blind denoiser (GCBD) [115] resolved the
problem of handling unpaired noisy images by first generating the ground truth, then
using the obtained ground truth as input into the GAN to train the denoiser. For more
complex corrupted images, a deep plug-and-play super-resolution (DPSR) method [114]
was developed to estimate blur kernel and noise and recover a high-resolution image.
Thus, in this section, we discussed existing neural network based methods for image
denoising, including both CNN and MLP based methods. Both of which are learning-
based approaches towards image denoising. In the next chapter, we will focus on
learning-free approaches like ”Deep Image Prior”, whose intuition is in direct contra-
diction to learning-based methods and are the foundation of most current state-of-art
image denoising methods.

Chapter 4
Deep Image Prior
In this Chapter, we introduce concepts of Deep Image Prior (DIP), discuss in detail the
model architecture and usage. after that we introduce few other works that are built
upon this concept.
4.1 Image Priors
Before discussing the seminal work of [10], it is crucial to understand the concept of
image priors. Image priors are prior information on any set of images [110] that one
can use in image processing (or computer vision) problems to enhance results, ease the
choice of processing parameters, resolve indeterminacies , etc. These priors, or their
approximations, can be converted into mathematical formulations and used as a part
of some central mechanism or procedure (or algorithm).
4.2 Deep Image Priors
Ulyanov et al. [10] shows that the structure of a generator network is sufficient to
encapsulate a great deal of low-level image statistics prior to any learning. What makes
this idea outstanding is that it is in direct contradiction to traditional understanding
that excellent performance of deep convolutions networks for denoising tasks is imputed
to their ability to learn realistic image priors from large number of example images.
With their novel concept there is no need to train a network on a dataset or even
22

23
perform any training at all.
To show this, we apply untrained Convolutional Neural Networks (ConvNets), and
instead of training a ConvNet on a large dataset of sample images, we fit a generator
network to a single corrupted image. This way, the network weights serve as a parame-
terization of the restored image, and the weights are randomly initialized and fitted to a
specific degraded image under a task-dependent observation model. In such a way, the
only information used to perform image reconstruction is contained in the single noisy
input image and the handcrafted structure of the network used for reconstruction [10].
The following four important empirical results are contributed by the paper [10] are-
• Low-level statistics can be captured by an untrained network.
• The model parameterization presents a high impedance to image noise, and hence,
it can be naturally used to filter out noise from a given image. Also, the DIP
method can work under blindness assumption.
• Choice of deep generator ConvNet architecture does have an impact on results as
different architectures impose rather different priors.
• DIP is similar to BM3D, one of the most popular transform techniques in the
respect that they both exploit self-structure and similarity.
4.2.1 Method
First, it is important to see the mathematical formulation for the idea of Deep image
priors upon which pivots some important future concepts ahead.
A function with a one-dimensional input and a multidimensional output can be
thought of as drawing a curve in space, and such a function is called a parametric
function (its input is called a parameter) [116]. Deep generator network is an example
of a parametric function that maps a code vector z to an image x [31].
x = fθ(z) (4.1)
If we were to interpret the neural network as a parameterization given in the above
equation (number) of the image x ∈ R3×H×W (channels, height, width), then in this
perspective, the code i.e z is a fixed randomized tensor z ∈ RC0×H0×W0
. The neural

24
network then can be viewed as mapping the parameters θ (weights from different layers,
bias of the filters in the networks) to the input image x.
To model conditional image distributions p(x|x0) where x is a natural image and x0
its corrupted version for image restoration problem (denoising), we can view such tasks
as energy minimization [10][117] problem -
x∗
= argmin
x
E(x; x0) + R(x) (4.2)
where E(x; x0) is a data term (that is task-dependent), and R(x) is a regularizer
which is not tied to a specific application and captures the generic regularity of natural
images. Instead of using an explicit regularizer term in equation Equation 4.2 the work
[10] proves that using implicit prior captured by the neural network parameterization
performs better for all image restoration tasks. Thus, the formulation can be re-written
as [10]:
θ∗
= argmin
θ
E(fθ(z); x0), x∗
= fθ∗ (z) (4.3)
The local minimizer θ∗ can be obtained using an optimizer such as gradient descent
and starting from a random initialization of the parameters θ. (Figure 4.1). As the
only information available to a restoration task is the noisy image x0, given the above
equation 4.3, the denoising process is obtained as x∗ = fθ∗ (z).
Another important fact demonstrated by [10] was that the choice of network ar-
chitecture has a major impact on how the solution space is searched by methods such
as gradient descent. This was an important observation because even though almost
any image can befitted by the model, they empirically show that the choice of archi-
tecture has a different impact on performance for different image restoration tasks.
and that the network resists “bad” solutions and descends much more quickly towards
naturally-looking images. The result is that minimizing Equation 4.3 either results in
a good-looking local optimum or, at least, that the optimization trajectory passes near
one as shown in Figure 4.1 [10].
Therefore, to understand this idea better, we can view this mathematically - given
a basic reconstruction task where the target image is x0 and we want to find values of
parameters θ∗ that reproduce the original image, the E(x; x0) term in equation 4.3, can

25
Figure 4.1: Image space visualization for DIP. Assume the problem of reconstructing an
image xgt from a degraded measurement x0. The image exemplified by denoising, the
ground truth xgt has non-zero cost E(xgt, x0) 0. Here, if run for long enough, fitting
with DIP will acquire a solution with near zero cost quite distant from xgt. However,
often the optimization path will pass close to xgt, and an early stopping (here at time
t3) will recover good solution. Source: Ulyanov et al. [10]
be modelled as the D2 distance that compares the generated image x with x0:
E(x; x0) = ||x − x0||2
(4.4)
⇒ minθ ||fθ(z) − x0||2
(4.5)
fθ(z) is (typically) a deep CNN with U-Shaped architecture [118]. One reason why
they are preferred is because one can draw samples from a DIP by taking random
values of the parameter θ and looking at the generated images fθ(z). Equivalently, this
means we can visualize the starting points of the optimization process (Eq. 4.3) before
we even fitting the parameters to the noisy input image. Also, [10] empirically shows
that the samples exhibit spatial structures and self-similarities, and the scale of these
structures depends on the network depth. Therefore, adding skip connections results
in images that contain structures of different characteristic scales, as is desirable for
modeling natural images. It implies then that natural that such architectures are the
most popular choice for generative ConvNets.
Also, this U-shaped architecture is an encoder-decoder (”hourglass”) network with
skip connections. Leaky-Relu is used for activation function and ADAM optimizer.

26
Figure 4.2: Figure depicting image restoration process using DIP. Starting from a ran-
dom weight θ0, one must iteratively update them in order to minimize the data term eq.
(4.3). At every iteration t the weights θ are mapped to an image x = fθ(z), where z is a
fixed tensor and the mapping f is a neural network with parameters θ. The image x is
used to calculate the task-dependent loss E(x, x0). The loss gradient w.r.t. the weights
θ is then calculated and used to update the parameters. Source: Ulyanov et al. [10]
Figure 5.1 in next chapter depicts the hourglass architecture. All works that improve
over DIP have similar architectural details as described here.
4.2.2 Important Results
The DIP paper shows experimental results for various image restoration tasks, including
single image denoising. They also claim that their model can work under the assumption
of blind denoising - where we do not know the noise model and can successfully recover
images from complex corruptions. The paper also has shown results for Gaussian noise
model but not for impulse noise, shot noise, or other noise models. One important
observation is that the DIP method’s performance for the Gaussian noise model is
similar to non-local learning-free approaches like CMB3D [91] but outperforms for non-
Gaussian noise models.
A hand-crafted prior method is in which we embed hard constraints and teach what
types of images are face, natural, etc., from the synthesized data. As no part of the

27
neural network fθ is learned from a dataset prior to this, such a deep image prior
is effectively hand-crafted, and empirically it is shown to outperform many standard
non-learning priors such as TV [119], BM3D [90] and few learning-based approaches.
4.3 Related Work
DIP method is similar to the works that exploit the self-similarity properties of natural
images and does not undergo training on hold-out set. In that regard, this approach
is similar to the BM3D approach (section 3.2.3) and the Non-Local means algorithm
[120]. These methods avoid any training and have hand-crafted priors.
The DIP work [10] has demonstrated a remarkable phenomenon that CNNs can
be used for solving image restoration problems without any offline training and exter-
nal data. Since then, many algorithms have been developed that improve upon this
extraordinary idea. [121] talks about an improvement using ”backprojection” or BP.
As we know, that image restoration tasks can be formulated as minimization of a cost
function, composed of a fidelity term and a prior term. We saw this type of formulation
in Chapter 2 Section 4.2.1, and it can be further generalized as follows:
min
x
l(x, y) + βs(x) (4.6)
where l is the fidelity term, s is the prior term, and β is a positive parameter that
controls the level of regularization [121].
Backprojection fidelity term was first introduced in 2018 by [23] as an alternative to
the widely used Least square (LS) fidelity term [121] : l(x, y) = 1
2||y −Ax||2
2 and empiri-
cally it has been shown that this fidelity term, for different priors that we have discussed
till now for example: TV, BM3D and pre-trained CNNs, yields better recoveries than
LS for badly conditioned A and requires fewer iterations of optimization algorithms. In
[121], they demonstrate the use of the BP fidelity term for improving the performance
of standard DIP (which uses LS fidelity term as the loss function). Although, the paper
only examines the performance for image deblurring tasks. It still remains to evalu-
ate this method’s performance for the remaining image restoration tasks such as image
denoising and super-resolution. In another line of work, Cheng et al. [122] show that
by conducting posterior inference using stochastic gradient Langevin dynamics, once

28
can avoid the need for early stopping, which is a major limitation of the current DIP
approach, and improve results for image restoration tasks. They prove that the DIP is
asymptotically similar to a stationary Gaussian process prior as the number of channels
in each layer of the network goes to infinity (in the limit) and derives the corresponding
kernel [122]. A Gaussian process is an infinite collection of random variables for which
any finite subset is jointly Gaussian distributed [122][123]. In another training-free ap-
proach, [124] presents a self-supervised learning method for single-image denoising. In
the introduced method, the network is trained with dropout on the pairs of Bernoulli-
sampled instances of the input image. The result is then estimated by averaging the
predictions generated from multiple instances of the trained model with dropout. The
authors empirically show that the proposed method not only significantly outperforms
existing single-image non-learning methods but also is competitive to the denoising net-
works trained on external datasets. Although, it still requires dropout and might be
unstable without early stopping.
4.4 Limitations of Deep Image Priors
DIP is one of the most popular methods for reconstruction tasks and was state-of-art
until recently. Though the methods have lots of merits, it has many limitations as
well. Some of the most significant unsolved issues for the DIP method is the need
for early stopping. Unless we employ early stopping, the PSNR (or SSIM) values will
drop after some number of iterations. [10] also employs early stopping and nearly most
of the research that came after DIP almost always use some level of early stopping.
Another issue is that there is obscurity towards the explanation of why does image
prior emerge, and why does these priors fit the structure of natural images so well.
There is no definitive answers for these questions that explain their effectiveness. With
respect to DIP’s practical applications, there are 2 main problems - it is extremely slow
and is unable to match or exceed the results of problem-specific methods [10]. For
example, we saw in Results subsection 4.2.2 that the performance for Gaussian noise
model for image denoising did not significantly outperform the non-local state-of-art
methods like CMB3D [91] or NLM [120]. In work by Ulyanov et al. [10], they have
used for image denoising task the image corruption rate is less than 0.5, so it remains

29
to see if this method is efficient for higher corruptions with respect to execution time
and performance trade-off.

Chapter 5
Rethinking Single Image
Denoising
Till now, we have discussed background concepts related on image denoising task and
few of the earlier methods that attempt to tackle it. We then discussed some deep
learning approaches to the open task specifically deep image priors. In this chapter we
will introduce few potential approaches to solve the limitations of DIP mentioned in last
the chapter and describe the experiments we conducted as a part of this thesis work1 .
5.1 Over-parameterisation in deep learning
Although DIP is extremely popular method since 2017, it has its limitations. As we
gain better insight and understanding regarding generalizing abilities of DNNs, a new
concept of learning over-parameterized models has emerged. It is becoming a crucial
topic in machine learning since 2017 [125][11][126][127].
Over-parameterization occurs when the number of learnable parameters is much
larger than the number of the training samples (or equivalently when we fit a richer
model than necessary). Deep artificial neural networks operate in this regime where
they have far more trainable model parameters than the number of training exam-
ples. Nevertheless, some of these models exhibit remarkably small generalization error,
i.e., the difference between “training error” and “test error”. The traditional learning
1
Work done under project investigator Taihui li and as a part of Sun research group.
30

31
theory suggests that when the number of parameters is large, some form of regular-
ization is needed to ensure small generalization error [128]. But recent research has
shown contradictory results to the traditional learning theory and has found that over-
parameterization empirically improves both optimization and generalization.
5.1.1 Overparameterisation v/s over-fitting?
It is important to remark that overparameterization and over-fitting are two different
phenomena, and over-parameterisation does not lead to over-fitting. When conventional
learning theory remarks that over-parameterization leads to over-fitting, the parame-
ters concerned are about hypothesis space from which the classifiers are constructed,
whereas, in deep neural networks, such parameters are those of the classifier construc-
tion part (fully connected layers). Thus, the learning theory concerns mostly about
the training of a classifier (learner) in classification tasks from a feature space, but it
tells little about the construction of the feature space itself. Thus, though we can use
the conventional theory to reason about generalization, we must to cautious when this
theory is applied for representation learning. This fact is demonstrated and debated
convincingly in [129].
5.1.2 Regularisation
Regularisation is a collective group of strategies that are explicitly designed to reduce
test error so that an algorithm will perform well not only on training data but also
on unseen inputs (test error). Many forms of regularization are available to the deep
learning practitioner. In fact, developing a more effective regularization strategy has
been one of the major research efforts in the field. There are two major types of
regularisation relevant to our thesis discussion - implicit and explicit regularisation.
• Regularization introduced either as an explicit penalty term or by modifying opti-
mization through, e.g., drop-outs, weight decay, or with one-pass stochastic meth-
ods can be referred to as explicit regularisation. A lot of work has been done
on understanding the effects of explicit regularisation on training data and deep
learning models’ performance.

32
• Implicit regularization would imply that some sort of regularization is being in-
troduced implicitly in a model. For example: in Neyshabur et al., 2014 [130], they
reason that the rationale behind low generalization error seen with overparame-
terized models is caused by an implicit regularization introduced by optimization
of the network. The optimization objectives for learning high capacity models
(which are overparameterized) have many global minima that fit training data
perfectly. Implicit regularization was adapted in DIP as well.
5.2 Low-rank matrix recovery problem
In this section, we will see the intuition behind the low-rank matrix recovery problem
and prior work that attempts to solve this issue. It is important to study this problem
because the network f(θ) in DIP [10] (chapter 4) has a U-shaped architecture and can
be viewed as a multi-layer, nonlinear extension of the low-rank matrix factorization
X = UUT . Therefore, DIP also inherits the drawbacks of the exact-parameterization
approach for low-rank matrix recovery. Namely, it requires either a meticulous choice
of network width or early stopping of the training process [11].
Low-rank matrices play an essential role in modeling and computational methods for
machine learning. They lay the foundation for both classical techniques such as principle
component analysis [131][132][133] as well as modern approaches to multi-task learning
[134][135] and natural language processing. Specifically, they have broad applications
in face recognition [136] (where saturation in brightness, self-shadowing, or specularity
can be modeled as outliers), video surveillance [136][11] (where the foreground objects
are usually modeled as outliers)and beyond.
However, the matrices we are eventually interested in can be extremely large. Al-
though memory costs (or costs for acquiring data) are getting cheaper, this will only
encourage bigger matrix sizes. This causes a number of issues, primarily that fully ob-
serving the matrix of interest can prove to be an impossible task. They can also be
corrupted with large errors. In such a case, we are left with a highly incomplete set of
observations, and unfortunately, in many of the most popular approaches to processing
the data in the low-rank matrices applications, we assume a fully sampled data set is

33
available. Another common assumption is that these approaches are generally not ro-
bust to missing/incomplete data. Thus, we have an inverse problem of retrieving the
full matrix from these incomplete observations. While such a recovery is not always
possible in general, but when the matrix is of low rank, it is possible to exploit this
structure and execute this kind of recovery in an astonishingly efficient manner [133].
Therefore, low-rank matrix recovery is an essential step towards solving many of the or
actual applications discussed above.
There are several methods to solve low-rank matrix recovery problems, out of which
the most commonly used in practice are low-rank approximation, low-rank recovery and
nuclear norm minimization, iterative hard thresholding, and alternating projections.
A long-established method for low-rank matrix recovery is via nuclear norm min-
imization. Such a method is provably accurate under certain incoherent conditions
[136][137]. However, minimizing nuclear norm involves expensive computations of sin-
gular value decomposition (SVD) of large matrices [133] (when n is large), which forbids
its application to problem size of practical interest. But these issues have been mitigated
with the recent development of matrix factorization methods [138][139]. These methods
reply on parameterizing the signal X ∈ Rn×n via factorization X = UUT . This gives
rise to a non-convex optimization problems with respect to U = ∈ Rn×r, where r is the
rank of X∗ [140].
5.3 Rethinking Single Image denoising: Main Ideas
Till now, fundamental concepts of over-parameterization, regularization, and low-rank
matrix recovery are discussed, all of which are vital towards understanding the next two
ideas presented in recent works by You et al. [11] and Jing et al. [141]. Finally, we will
present an ensemble approach that combines concepts of these two papers and argue
as to why the combination of these two ideas might provide insight for solving current
DIP limitations and be a step towards robust image recovery.

34
5.3.1 Image denoising via Implicit Bias of Discrepant Learning Rates
In [11], the authors discuss that the challenges associated with the exact-parameterization
methods can be simply and effectively dealt with via over-parameterization and dis-
crepant learning rates. Their arguments are supported by the recent results in [126][142]
for low-rank matrix recovery.
The success for the success of this method is the notion of implicit bias of discrepant
learning rates. The concept is that the algorithmic low-rank and sparse regularizations
need to be balanced for the purpose of discerning the underlying rank and sparsity.
In absence of of means for tuning a regularization parameter [11], the authors show
that the desired balance can be acquired by using different learning rates for different
optimization parameters. Below are the four subsections that summarize the main
results and algorithms discussed in this paper.
Double Over-Parameterization Formulation
In [11], the aim is to learn an unknown signal X∗ ∈ Rn×n from its grossly corrupted
linear measurements:
y = A(X∗) + s∗ (5.1)
where operator A(·) : Rn×n → Rm, and s∗ ∈ Rm is a sparse corruption vector (this
formulation is similar to discussed in Section on noise models). Equivalently, it is a
problem of recovering a rank-r (r n) positive semi-definite matrix X∗ from its
grossly corrupted linear measurements as given in equation 5.1.
The work introduces a double over-parameterization approach for robust matrix
recovery, with double-parameterization of X = UUT and s = g ◦ g − h ◦ h:
min
U∈Rn×r0
,{g,h}⊆Rm
f(U, g, h) :=
1
4
||A(UUT
) + (g ◦ g − h ◦ h) − y||2
2 (5.2)
where the dimensional parameter r0 ≥ r. Practically, the choice of r0 depends on how
much prior information we have for X∗. It can be either taken as an estimated upper
bound for r or takes as a r0 = n with no prior knowledge.
Thus, the authors introduce a method that is based on over-parameterizing both
the low-rank matrix X∗ and the outliers s∗ and thereby leveraging implicit algorithmic
bias to find the correct solution (X∗, s∗).

35
Algorithmic Regularizations via Gradient Descent
In general, over-parameterization leads to under-determined problems which can have an
infinite number of solutions (analogous to linear algebra where the number of parameters
exceeds the number of equations). Thus, not all solutions of doubly over-parameterized
equation 5.2 will correspond to desired (X∗, s∗). The paper empirically and theoreti-
cally proves that the gradient descent iteration on equation 5.2 with properly selected
learning rates enforces implicit bias on the solution path and thereby automatically iden-
tifying the desired, regularized solution (X∗, s∗). Proof of the above ideas is beyond
the scope of this work.
Implicit Bias with Discrepant Learning Rates
Optimizing a linear multi-layer neural network via gradient descent leads to a low-rank
solution and this phenomenon is known as implicit regularization. It has been exten-
sively studied under the context of matrix factorization [126][143][144], linear regression
[145][146], logistic regression [125], and linear convolutional neural networks [127]. It
is well known that optimization algorithms like gradient descent introduces implicit bi-
ases (without early stopping) [125] and play a crucial role in generalization ability of
learned models. But it is still an open challenge of how to control the implicit regular-
ization of the gradient descent. Also, they theoretically found the value of penalty (λ)
as the algorithm approaches convergence for the unconstrained Lagrangian formulation
of rank-r matrix X∗. They further prove that controlling the implicit regularization
without explicitly adding any regularization term in equation 5.2 can be achieved by
adapting the ratio of learning rates. This observation directly contradicts conventional
optimization theory [147] that learning rates only affect algorithm convergence rate but
not the quality of the solution.
Extension to Natural Images Denoising
Finally, combining the ideas from all the above subsections to solve the image restoration
problem, the approach in [11] is inspired from [10]’s DIP method where they use the
formulation in the equation 5.1 with X = φ(θ), which is a deep convolutional network
and θ ∈ Rc represents network parameters.

36
Figure 5.1: Architecture used in [10] and also the base architecture for You et al. [11].
The hourglass (also known as decoder-encoder architecture. It sometimes has skip
connections represented in yellow. nu[i], nd[i], ns[i] correspond to the number of filters
at depth i for the upsampling, downsampling, and skip-connections respectively. The
values ku[i], kd[i], ks[i] correspond to the respective kernel sizes. Source: Ulyanov et al.
[10]
With respect to implementation details, the network φ(θ) is similar to the original
DIP work [10]. It has the same U-shaped architecture with skip connections where each
layer contains a convolutional layer, LeakyRelu layer, and a batch normalization layer.
The noise model for the images in the You et al. is salt and pepper noise.
Thus, the ideas presented in [11] are promising. Due to algorithmic bias of discrepant
learning rates, the need to tune network width or early termination is eliminated because
it alleviates the problem of over-fitting for robust image recovery. The above advantage
also enables the method to recover different image types with varying levels of corruption
levels without the need to tune the network learning parameters. That means for
different noise models you would not need to change the network width or other learning
parameters.
5.3.2 Implicit Rank-Minimizing Autoencoder
Autoencoders (AE) is a popular category of methods for learning representations with-
out requiring labeled data. We discussed this more in detail in Section 2.4. An essential
component of autoencoder methods is the method by which the information capacity

37
of the latent representation is minimized or limited. In [141], the rank of the covari-
ance matrix of the codes is implicitly minimized by depending on the fact that gradient
descent learning in multi-layer linear networks leads to minimum-rank solutions.
5.4 Proposed Methodology
Now that we understand the two main ideas presented in this chapter, we discuss an
ensemble method that is an amalgamation of these two ideas. Exploiting the double
over-parameterization of two low-dimensional structures in the image restoration ob-
jective along with discrepant learning rates to regularize the optimization path, we can
forgo the new need of early termination and parameter tuning. Whereas adding addi-
tional linear between layers encoder and decoder ensures the minimum possible rank
regularized solution which ensures convergence.
With respect to network details, it is not entirely similar in structure as You et al.
[11] which is also U-shaped architecture with skip connections like DIP. Our method
has a U-shaped architecture with residual connections and some more modifications.
But we make some additional changes. We change the constituents of the encoder-
decoder blocks (as shown in figure 5.1) from vanilla ConvNet Layers to ResNet. For
our method, we have multiple ResNet blocks, and each block consists of three each of
convolutional, batch normalization, and Leaky ReLU layers. This is in contrast to the
original DIP which did not have batch normalization layers or LeakyRelu. You et al.
[11] used ReLU activation instead. Also, we do not over-parameterize our noise model
as is done by You et al. [11]. Another change with respect to DOP (we will sometimes
refer You et al. work as ”Double Over-Parameterized Prior” - DOP) is that we added
additional three linear layers between the encoder-decoder blocks inspired by Jing et al.
[141]. In the next chapter, we share results with respect to different number of linear
layers. In default configuration, we always use l1 loss as compared to DIP and DOP,
which use MSE loss.
Thus, towards finding an effective solution for our problem of single image denoising,
we proposed a solution that ensures minimum rank regularized solution via double over
parameterization of both the minimum rank matrix signal and sparse corruption vector
and leveraging implicit algorithmic bias. In this chapter, we laid the foundations of this

38
approach with theoretical arguments. In next chapter, we detail the experimental steps
and results for our approach.

Chapter 6
Preliminary Experiments
In this chapter, we present results and analyze observations on some preliminary exper-
iments that were done during the course of this thesis work. The results stated below
are a part of an on going effort towards solving some of the limitations of single-image
denoising (as discussed in chapter 4 and 5).
6.1 Dataset
In this thesis work, we have focused on image restoration exclusively single-image de-
noising. For developing the algorithm, the popular set of images, which are also used
widely in almost all image denoising works are - a set of 8 images: Lena, peppers, F16-
GT, Barabara, Lake, Kodak Inc., baboon, snail. These are the standard bench-marking
images, and we have also tested our algorithm on a subset of this set - F16 and Lena.
6.2 System Configuration
We used Google Colab Pro, NVIDIA Tesla P100 GPU server, 16GB system RAM and
100GB of Google cloud storage for development and experimentation. In our work, we
have used the 1.8.1 Pytorch libraries.
39

40
6.3 Hyper-parameter tuning
We have tuned our models with all the possible combinations of values mentioned below
for the following parameters: learning rate, optimization functions, kernel size, activa-
tion functions, and input noise models. We have varied the learning rates from 1e-5 to
0.1. The optimizers used are Adam [148], and SGD [149], and the activation functions
used are ReLU and Leaky-ReLU [150]. The input noise models are Gaussian and salt
and pepper models for different corruption rates varying from 20% to 90%.
6.4 Results and Observations
In the figures below, we show results for three different methods - DIP, You et al. [11]
and our proposed approach (outlined in previous chapter). All results shared used SGD
as optimizer and learning rates of 0.01, τ = 1, and 0.1. For DIP, we use a variant of the
original method where loss function used is l1 loss. DIP-l1 gives far better results than
MSE DIP. Similarly, for You et al., and our proposed approach, we use l1 loss function.
We report results using PSNR values where for DIP-l1 we output on the last iterations
averaged using exponential sliding window (as reported in the paper, Average Output
of the model gives excellent results in Blind image denoising.) Below we show results
for our architecture for image Lena. Noise models used are salt and pepper with various
corruption levels (starting from 0.5 to 0.9) and Gaussian noise with σ = 25.
In Figure 6.1, we depict Lena, its 60% corrupted version with salt and pepper noise,
and the reconstructed image using our proposed denoising algorithm. Figures 6.4, 6.5,
and 6.6 show comparison in performance for all three methods for the same corruption
type and level.
We performed experiments with various corruption levels for all three methods and
saw that as corruption level increases (salt and pepper noise), the best PSNR values
decrease, and reconstruction becomes less noiseless. This fact is depicted in Figure 6.9.
Our method and You et al. [11] performs better with higher corruption levels than DIP
while our method performs mostly better than rest for corruption levels lower than 70%.
For higher corruption level You et al. [11] (or DOP method) is slightly better but at
average our method is more stable and consistent in performance than the rest. The
results reported in this chapter are all based on l1 loss, even for DIP we use it’s l1 loss

41
variant when the original paper [10] uses MSE loss. Another important observation is
that You et al. [11]’s method, if used with ADAM optimizer, needs early termination to
stop the dip in the PSNR values. The same trend is seen in our approach. Figure 6.7-6.8
depicts this observation for You et al. [11] (DOP) approach and shows that the network
starts learning noise in the absence of early stopping. The reconstruction PSNR worsens
as training progresses. Among another novelty of our method - addition of linear layers
added between the encoder-decoder block, there are important observations regarding
the effect of increasing linear layers. As Jing et al. prove that adding more linear layers
will increase the regularization effect, this claim was consistent with our observation as
well, along with another important result. For increasing the linear layers from 3 to 6
as the regularization effect increases, the time to reach best PSNR decreases (as shown
in Figure 6.10). Not only the network becomes comparatively more stable the number
of epochs needed to reach the best PSNR also decreases. On average of 3 separate runs,
the average number of epochs for 3 and 6 linear layers were: 19625, 15375 consecutively.
For 9 linear layers average of 3 different trails was 16375, which is lesser than 3 layers’
time but slightly greater than 6 layers’ epochs. With more number of trials, there is a
possibility that we might see a clearer trend. (PS: the Figures 6.9-6.10 report results
on F16-GT image).

42
Figure 6.1: From top left to bottom right: (a) The images in top row show ground truth
image Lena and (b) its noisy counterpart using 60% corruption level for salt and pepper
noise. The bottom row images show (c) Real image same as (a), (d) noisy image same
as (b) and the, (e) reconstructed image using our approach.

43
Figure 6.2: From top to bottom: (a) The image in top shows PSNR plot for correspond-
ing Figure 6.1, (b) loss plot (L1 loss) for reconstruction process using our approach.

44
Figure 6.3: From top left to bottom: (a) The images in top row show ground truth image
F16-GT and (b) its noisy counterpart using 80% corruption level for salt and pepper
noise. The bottom row image shows (c) the reconstructed image using our approach.
The best PSNR achieved is 21.8671. As we can see for higher corruptions we get poorer
performance.

45
Figure 6.4: From top left to bottom: (a) The images in top row show ground truth
image F16-GT and (b) its noisy counterpart using 50% corruption level for salt and
pepper noise. The bottom row image shows (c) the reconstructed image using [10]
DIP-l1 approach. The best PSNR achieved is 28.548 dB.

46
Figure 6.5: From top left to bottom right: (a) The images in top row show ground
truth image F16-GT and (b) its noisy counterpart using 50% corruption level for salt
and pepper noise. The bottom row image left shows (c) original image same as (a), (d)
shows noisy image same as (b) and, (e) is the reconstructed image using our approach.
The best PSNR achieved is 29.2449 dB.

47
pepper noise. The bottom row image shows (c) the reconstructed image using You et
al. [11] approach (width = 128). The best PSNR achieved is 28.9 dB.

48
pepper noise. The bottom row image shows (c) the reconstructed image using You et
al. [11] approach (width = 128) and ADAM optimiser (unless mentioned optimizer is
SGD). The image reconstructed becomes noisier as the training continues because the
network starts learning noise in absence of early termination.

49
Figure 6.8: (a) The image shows PSNR plot for corresponding Figure in 6.7. It clearly
demonstrates the need for early termination. The best PSNR achieved is 28.14 dB
before the dip.

50
Figure 6.9: (a) The image shows PSNR plots for different corruption levels for each of
the three methods discussed in last chapter. The line plots show best PSNR levels across
models for different corruption rates. The plot supports our claim in observations that
with increasing corruption rate the best PSNR level reached by our all three methods
decreases. Also, our method is the most consistent and stable in performance amongst
all three methods.

51
Figure 6.10: The plot depicts the effect of number of linear layers for our method when
the number goes from 3 to 9. For three independent trials, the average performance is
depicted via the dashed line and we see that the number of epochs to reach the highest
PSNR value decreases when we increase the number of layers from 3 to 6. From 6 to 9
layers we see a slight increase. The corruption level for all three trials is 50%.

Chapter 7
Conclusion and Discussion
In this thesis work, we aimed to study the fundamental theory for image restoration
tasks, their cause, and artifacts. We started with basic image restoration tasks math-
ematical formulation and dived into the conceptual theory of popular deep learning
methods such as CNN, autoencoders, and GANs, which are building blocks of many
state-of-art denoising algorithms. We also reviewed different noise models, what causal
effect they have on the restoration tasks, and relevant perpetual quality measurements
used to measure recovery performance, especially for image denoising. We attempted
to chronologically categorize the existing work in the image denoising field and analyze
the shortcomings of classical spatial transform methods that lead to transform domain
methods, followed by the current family of state-of-the-art performance obtained by
learned priors with deep neural networks - the NN-based methods. In chapter 4, we
introduced seminal work by Ulyanov et al. [10] on deep image priors which are the
foundation of the current learning-free untrained network methods. This work led to
to a shift in perspective from conventional theory and motivated newer research ideas
like [122] and [11]. We saw that despite deep image priors competitive performance
with non-local methods like BM3D, there exists a number of non-trivial limitations
that hinder a reliable adaption of this technology in practical scenarios, for example,
the need for early stopping. There also was an incomplete understanding with respect
to the method’s network generalization performance and quantification. Since 2017, we
saw growth in understanding as to why CNN’s or overparameterized networks general-
ized so well and the hardness of NN. Since then, many have come up with interesting
52

53
ideas improving untrained networks paradigms and making them more robust towards
their limitations. [11] proved that algorithm bias of discrepant learning rates in a doubly
over-parameterized network will eliminate any need for tuning learnable parameters and
early stopping. We exploit this work and in [141] to propose a new method for single
image denoising. We argue why this methodology might work and be a step towards
building a generalized image denoising algorithm
There are many limitations of this work. While we suggest a probable method-
ology exploiting the best of two ideas as discussed in chapter 5, it is not without its
shortcomings. As we saw in [10], there is an explicit need for early stopping. This
problem does not exist for [11] but only if we use gradient descent as an optimizer. For
Adam optimizer, the need for early stopping still exists, and this is also reflective in our
proposed method. Another major disadvantage is that [11] and [10] cited results for
impulse and Gaussian noise models, i.e., sparse corruptions (results can be extrapolated
for shot and speckle noise models), but these methods may not work for other types of
noise models (non-sparse corruptions) such as Defocus Blur, Elastic noise, etc. [151].
The DIP paper argues that their method can work for complex noise models in a blind
denoising process, but experimental analysis shows that the results are not competitive
as compared to the classical methods such as BM3D. Further, with more complex noise
models, SGD as an optimizer might not work, leaving us with a need for early stopping
yet again. For our proposed algorithm, we still need to explore its fallacies for more
noise models, corner cases and need to find a way to mitigate overfitting. The results
presented in the last chapter are were part of a preliminary investigation and did not
demonstrate the generalization capacity of the algorithm.
Thus, we proposed a method that eliminates the need of learnable parameter tun-
ing, early stopping and gives better results as compared to the DIP or [11] for simple
noise models. But our method is still not robust to all 19 types of noise models in
[151]. Many recent works suggest interesting ideas like TV regularized DIP [71] or
Bayesian DIP [122], and there is definite potential for extrapolating these methods with
our approach of deep linear autoencoders with skip connections, algorithmic bias of
discrepant learning rates, and SGD. Thus, it is still an open challenge to develop a
generalized algorithm that handles multiple corruptions in a stable manner.

References
[1] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian.
Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE
Transactions on Image Processing, 16(8):2080–2095, 2007.
[2] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond
a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE
Transactions on Image Processing, PP, 08 2016.
[3] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and
Ronald M. Summers. Chestx-ray8: Hospital-scale chest x-ray database and bench-
marks on weakly-supervised classification and localization of common thorax
diseases. 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Jul 2017.
[4] Fahad Shamshad, Muhammad Awais, Muhammad Asim, Zain ul Aabidin Lodhi,
Muhammad Umair, and Ali Ahmed. Leveraging deep stein’s unbiased risk esti-
mator for unsupervised x-ray denoising, 2018, 1811.12488.
[5] Wikipedia. Artificial neural network. https://en.wikipedia.org/wiki/
Artificial_neural_network#/media/File:Neuron3.png.
[6] Malcolm Sambridge. An introduction to Inverse Problems. http://web.gps.
caltech.edu/classes/ge193.old/lectures/Lecture1.pdf.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition, 2015, 1512.03385.
[8] Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders, 2021, 2003.05991.
54

55
[9] Marc Lebrun. An analysis and implementation of the bm3d image denoising
method. Image Processing On Line, 2:175–213, 08 2012.
[10] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In-
ternational Journal of Computer Vision, 128(7):1867–1888, Mar 2020.
[11] Chong You, Zhihui Zhu, Qing Qu, and Yi Ma. Robust recovery via implicit bias
of discrepant learning rates for double over-parameterization, 2020, 2006.08857.
[12] Joanna J. Bryson. The past decade and future of ai’s impact on society.
[13] Jack Copeland. The cyc project.
[14] Rupali Ahuja Rajshree. A general review of image denoising techniques.
[15] Mukesh Motwani, Mukesh Gadiya, Rakhi Motwani, and Frederick Harris. Survey
of image denoising techniques. 01 2004.
[16] S M A Sharif, Rizwan Ali Naqvi, and Mithun Biswas. Learning medical image
denoising with deep dynamic residual attention network. Mathematics, 8(12),
2020.
[17] Dang Thanh, Surya Prasath, and Hieu Le Minh. A review on ct and x-ray images
denoising methods. Informatica, 43:151–159, 06 2019.
[18] Wikipedia. Deep learning. https://en.wikipedia.org/wiki/Deep_learning#
Deep_neural_networks.
[19] Awan-Ur-Rahman. What is artificial neural network and how it mimics the human
brain?
[20] MIT. Inverse problems. http://web.mit.edu/2.717/www/inverse.html.
[21] Encyclopedia of Mathematics. Ill-posed problems. https://
encyclopediaofmath.org/wiki/Ill-posed_problems.
[22] Wikipedia. Inverse problem. https://en.wikipedia.org/wiki/Inverse_
problem#:~:text=An%20inverse%20problem%20in%20science,measurements%
20of%20its%20gravity%20field.

56
[23] Tom Tirer and Raja Giryes. Image restoration by iterative denoising and backward
projections. IEEE Transactions on Image Processing, PP, 10 2017.
[24] Y. Bengio. Learning deep architectures for ai. Foundations, 2:1–55, 01 2009.
[25] Juergen Schmidhuber. Deep learning in neural networks: An overview. Neural
Networks, 61, 04 2014.
[26] Weibo Liu, Zidong Wang, Xiaohui Liu, Nianyin Zeng, Yurong Liu, and Fuad
Alsaadi. A survey of deep neural network architectures and their applications.
Neurocomputing, 234, 12 2016.
[27] Kaiming He and Jian Sun. Convolutional neural networks at constrained time
cost, 2014, 1412.1710.
[28] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway net-
works, 2015, 1505.00387.
[29] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford Uni-
versity Press, Inc., USA, 1995.
[30] W. Venables and B. Ripley. Modern applied statistics with s fourth edition. 2002.
[31] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial
networks, 2014, 1406.2661.
[32] Wikipedia. Generative adversarial network. https://en.wikipedia.org/wiki/
Generative_adversarial_network.
[33] Jason Brownlee. A gentle introduction to generative adver-
sarial networks (gans). https://machinelearningmastery.com/
what-are-generative-adversarial-networks-gans/.
[34] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford,
and Xi Chen. Improved techniques for training gans, 2016, 1606.03498.
[35] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image
translation with conditional adversarial networks, 2018, 1611.07004.

Robustness in Deep Learning: Single Image Denoising using Untrained Networks

Robustness in Deep Learning: Single Image Denoising using Untrained Networks

Recommended

Recommended

More Related Content

Similar to Robustness in Deep Learning: Single Image Denoising using Untrained Networks

Similar to Robustness in Deep Learning: Single Image Denoising using Untrained Networks (20)

More from Daniel983829

More from Daniel983829 (10)

Recently uploaded

Recently uploaded (20)

Robustness in Deep Learning: Single Image Denoising using Untrained Networks