SlideShare a Scribd company logo
1 of 81
Download to read offline
Robustness in Deep Learning: Single Image Denoising
using Untrained Networks
A THESIS
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Esha Singh
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
Master of Science
Ju Sun
May, 2021
© Esha Singh 2021
ALL RIGHTS RESERVED
Acknowledgements
I would first like to thank my advisor, Professor Ju Sun for providing me with an
opportunity to be a part of his research lab and for his continuous support and guidance.
This thesis work would not have been successful without his able advice, feedback and
teachings.
I would also like to thank Taihui Li for helping me with the experiments, his constant
support, discussions, analysis and feedback, which helped me completion of my work
and improving results.
I would also like to thank the members of my thesis committee, Professor Hyun Soo
Park and Professor Gilad Lerman.
Finally, my deep and sincere gratitude to my family and friends for their uncondi-
tional and unparalleled love and support.
i
Dedication
To my mother and father, friends, and colleagues who have mentored and held me up
along the way.
ii
Abstract
Deep Learning has become one of the cornerstones of today’s AI advancement and
research. Deep Learning models are used for achieving state-of-the-art results on a wide
variety of tasks, including image restoration problems, specifically image denoising.
Despite recent advances in applications of deep neural networks and the presence of a
substantial amount of existing research work in the domain of image denoising, this task
is still an open challenge. In this thesis work, we aim to summarize the study of image
denoising research and its trend over the years, the fallacies, and the brilliance. We
first visit the fundamental concepts of image restoration problems, their definition, and
some common misconceptions. After that, we attempt to trace back where the study
of image denoising began, attempt to categorize the work done till now into three main
families with the main focus on the neural network family of methods, and discuss some
popular ideas. Consequently, we also trace related concepts of over-parameterization,
regularisation, low-rank minimization and discuss recent untrained networks approach
for single image denoising, which is fundamental towards understanding why the current
state-of-art methods are still not able to provide a generalized approach for stabilized
image recovery from multiple perturbations.
iii
Contents
Acknowledgements i
Dedication ii
Abstract iii
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Image Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Image Restoration Problem Formulation . . . . . . . . . . . . . . 7
2.3 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 ResNets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Noise models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
iv
2.6 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.1 MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.2 PSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.3 SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Image Denoising Algorithms: Review 15
3.1 Spatial domain methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Spatial domain filtering . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Variational denoising methods . . . . . . . . . . . . . . . . . . . 16
3.2 Transform domain methods . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Data adaptive methods . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Non Data adaptive methods . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 Block-matching and 3D filtering: BM3D . . . . . . . . . . . . . . 19
3.3 Deep Neural Network methods . . . . . . . . . . . . . . . . . . . . . . . 20
4 Deep Image Prior 22
4.1 Image Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Deep Image Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Important Results . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Limitations of Deep Image Priors . . . . . . . . . . . . . . . . . . . . . . 28
5 Rethinking Single Image Denoising 30
5.1 Over-parameterisation in deep learning . . . . . . . . . . . . . . . . . . . 30
5.1.1 Overparameterisation v/s over-fitting? . . . . . . . . . . . . . . . 31
5.1.2 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Low-rank matrix recovery problem . . . . . . . . . . . . . . . . . . . . . 32
5.3 Rethinking Single Image denoising: Main Ideas . . . . . . . . . . . . . . 33
5.3.1 Image denoising via Implicit Bias of Discrepant Learning Rates . 34
5.3.2 Implicit Rank-Minimizing Autoencoder . . . . . . . . . . . . . . 36
5.4 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
v
6 Preliminary Experiments 39
6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Hyper-parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.4 Results and Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7 Conclusion and Discussion 52
References 54
Appendix A. Glossary and Acronyms 68
vi
List of Tables
A.1 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
vii
List of Figures
1.1 Performance of existing medical image denoising methods in removing
image noise. (a) Noisy input, (b) Result obtained by BM3D [1], (c)Result
obtained by DnCNN [2]. Source by: (https://www.kaggle.com/mateuszbuda/
lgg-mri-segmentation). . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Denoising results for Chest X-Ray Dataset [3] for Gaussian noise with
standard deviation of 25. Left to Right (a) noisy image (b) denoisied
image using an unsupervised cleaning. Source: Shamshad et al. [4]. . . . 4
2.1 ANN inspired by biological neural networks. The inputs are denoted
by x1, x2...xn at the dendrites and outputs by y1, y2, ...yn at the axon
terminal ends. Source: Wikipedia[5]. . . . . . . . . . . . . . . . . . . . . 6
2.2 Inverse problem. Source: caltech GE193[6] . . . . . . . . . . . . . . . . . 7
2.3 Residual learning: a building block. (Source: He et al. [7]) . . . . . . . . 8
2.4 An example of an autoencoder. The input image is encoded to a com-
pressed representation and then decoded. (Source: Bank et al. [8]) . . . 11
3.1 Scheme of the BM3D algorithm. (credits: Marc Lebrun [9]) . . . . . . . . . . 19
4.1 Image space visualization for DIP. Assume the problem of reconstructing
an image xgt from a degraded measurement x0. The image exemplified by
denoising, the ground truth xgt has non-zero cost E(xgt, x0) > 0. Here,
if run for long enough, fitting with DIP will acquire a solution with near
zero cost quite distant from xgt. However, often the optimization path
will pass close to xgt, and an early stopping (here at time t3) will recover
good solution. Source: Ulyanov et al. [10] . . . . . . . . . . . . . . . . . 25
viii
4.2 Figure depicting image restoration process using DIP. Starting from a
random weight θ0, one must iteratively update them in order to minimize
the data term eq. (4.3). At every iteration t the weights θ are mapped
to an image x = fθ(z), where z is a fixed tensor and the mapping f is a
neural network with parameters θ. The image x is used to calculate the
task-dependent loss E(x, x0). The loss gradient w.r.t. the weights θ is
then calculated and used to update the parameters. Source: Ulyanov et
al. [10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1 Architecture used in [10] and also the base architecture for You et al. [11].
The hourglass (also known as decoder-encoder architecture. It sometimes
has skip connections represented in yellow. nu[i], nd[i], ns[i] correspond
to the number of filters at depth i for the upsampling, downsampling,
and skip-connections respectively. The values ku[i], kd[i], ks[i] correspond
to the respective kernel sizes. Source: Ulyanov et al. [10] . . . . . . . . . 36
6.1 From top left to bottom right: (a) The images in top row show ground
truth image Lena and (b) its noisy counterpart using 60% corruption
level for salt and pepper noise. The bottom row images show (c) Real
image same as (a), (d) noisy image same as (b) and the, (e) reconstructed
image using our approach. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 From top to bottom: (a) The image in top shows PSNR plot for cor-
responding Figure 6.1, (b) loss plot (L1 loss) for reconstruction process
using our approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 From top left to bottom: (a) The images in top row show ground truth
image F16-GT and (b) its noisy counterpart using 80% corruption level
for salt and pepper noise. The bottom row image shows (c) the recon-
structed image using our approach. The best PSNR achieved is 21.8671.
As we can see for higher corruptions we get poorer performance. . . . . 44
6.4 From top left to bottom: (a) The images in top row show ground truth
image F16-GT and (b) its noisy counterpart using 50% corruption level
for salt and pepper noise. The bottom row image shows (c) the recon-
structed image using [10] DIP-l1 approach. The best PSNR achieved is
28.548 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
ix
6.5 From top left to bottom right: (a) The images in top row show ground
truth image F16-GT and (b) its noisy counterpart using 50% corruption
level for salt and pepper noise. The bottom row image left shows (c)
original image same as (a), (d) shows noisy image same as (b) and, (e) is
the reconstructed image using our approach. The best PSNR achieved is
29.2449 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6 From top left to bottom: (a) The images in top row show ground truth
image F16-GT and (b) its noisy counterpart using 50% corruption level
for salt and pepper noise. The bottom row image shows (c) the recon-
structed image using You et al. [11] approach (width = 128). The best
PSNR achieved is 28.9 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.7 From top left to bottom: (a) The images in top row show ground truth im-
age F16-GT and (b) its noisy counterpart using 50% corruption level for
salt and pepper noise. The bottom row image shows (c) the reconstructed
image using You et al. [11] approach (width = 128) and ADAM optimiser
(unless mentioned optimizer is SGD). The image reconstructed becomes
noisier as the training continues because the network starts learning noise
in absence of early termination. . . . . . . . . . . . . . . . . . . . . . . . 48
6.8 (a) The image shows PSNR plot for corresponding Figure in 6.7. It clearly
demonstrates the need for early termination. The best PSNR achieved
is 28.14 dB before the dip. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.9 (a) The image shows PSNR plots for different corruption levels for each
of the three methods discussed in last chapter. The line plots show best
PSNR levels across models for different corruption rates. The plot sup-
ports our claim in observations that with increasing corruption rate the
best PSNR level reached by our all three methods decreases. Also, our
method is the most consistent and stable in performance amongst all
three methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
x
6.10 The plot depicts the effect of number of linear layers for our method when
the number goes from 3 to 9. For three independent trials, the average
performance is depicted via the dashed line and we see that the number
of epochs to reach the highest PSNR value decreases when we increase
the number of layers from 3 to 6. From 6 to 9 layers we see a slight
increase. The corruption level for all three trials is 50%. . . . . . . . . . 51
xi
Chapter 1
Introduction
Artificial Intelligence (AI) - is a phenomenon where intelligence is demonstrated by ma-
chines, unlike the natural intelligence exhibited by humans, which involves consciousness
and emotionality. For two decades, AI has been at the helm of revolutionary changes
in the face of industrial, academic research, and development. The past decade, and
notably the past few years, has been transformative for AI, not so much in terms of
what we can do with this technology (theoretical) as what we are doing with it (applied)
[12].
The main ideology behind AI has been the perfect emulation of human intelligence
and an attempt to give them coherent reasoning. Can machines gain common sense?
There are two fundamental paradigms to this problem. One is to provide a compre-
hensive set of facts and rules encoding human knowledge (an undertaking by the Cyc
Project since 1984 [13]). The other being facilitating the self-learning process of ma-
chines, similar to how humans develop commonsense. The latter approach has shown
great promise, with Deep Learning (DL) being the dominant tool to help machines gain
perception of the world around them. With the abundance of work that exists in this
sphere, it is not difficult to experience the power as well as various limitations of this
technology. The lack of generality, data bias, fairness, and robustness to unforeseen
situations are some of the well-known challenges in this field. The aim of this thesis
work is to focus on one such particular challenge; Robust image recovery. It is a relevant
issue that is still an open problem and is omnipresent in real-life situations. Self-driving
vehicles, digital photography, medical image analysis, remote sensing, surveillance, and
1
2
digital entertainment are a few of the applications where due to unprecedented suscep-
tibilities, existing solutions might not perform as expected. Robustness against natural
corruptions or robustness in medical problems are some of the non-trivial open chal-
lenges in the sphere of robustness in deep learning, and to tackle such big issues it is
advantageous to break them into smaller sub-tasks. Thus, a small step towards handling
those situations is to solve the classical yet active problem of robust image recovery un-
der synthetic noise models. If one can solve this problem reliably, it can give us insight
as to how to work our way towards the more significant hurdles.
One of the rudimentary challenges in the field of image processing and computer
vision is image denoising, where the underlying goal is to approximate the actual image
by suppressing noise from a noise-contaminated version of the image. Image noise may
be caused by several intrinsic (i.e., sensor) and extrinsic (i.e., environment) conditions
which are often not possible to avoid in practical situations. Image denoising is a funda-
mental yet active problem and still remains unsolved because noise removal introduces
artifacts and unwanted effects such as blurring of the images.
The focus of this thesis work is to summarize the fundamental concepts behind image
denoising tasks, existing work in the field, qualitative analysis of state-or-art methods
for this tasks and finally, present a probable approach with supportive arguments and
preliminary experiments undertaken during the course of this research work.
1.1 Application
Digital images play an essential role both in daily life applications such as satellite
television, medical imaging application, computer tomography, as well as in areas of
research and technology such as geographical information systems and astronomy. So it
is not difficult to gauge the importance of recovering precise images. It is the first and
vital step before images can be analyzed or used further. Thus, image denoising plays a
vital role in a wide range of applications such as image restoration, image registration,
visual tracking, image segmentation, and image classification, where obtaining the orig-
inal image content is crucial for strong performance. It is important to develop effective
denoising techniques in order to compensate for data corruption which is introduced
when data is collected by imperfect instruments which are generally contaminated by
3
noise, issues with the data acquisition process [14], and interceding natural phenomena
[15].
An important practical application for image denoising is in Medical Sciences. Med-
ical images obtained from MRI are the most common tool for diagnosis in Medicine and
are often influenced by random noise arising in the image acquisition process. Hence,
noise removal is essential in medical imaging applications in order to enhance and recover
fine-grained details that may be hidden in the data.
Medical imaging including X-rays, Magnetic Resonance Imaging (MRI), Computer
Tomography (CT), ultrasound, etc., are susceptible to noise due to reasons discussed in
the last section. Hence, it important to recover original, high-quality, noiseless images.
Image denoising in the field of Medicine is referred to as Medical image denoising and
is a process of improving the perpetual quality of degraded noisy images captured with
specialized medical image acquisition devices [16]. Figure 1.1 is an example of how
existing MID methods illustrates deficiencies in large-scale noise removal from medical
images and immensely fail in numerous cases [16].
Another use case for medical imaging applications is with respect to X-ray images.
X-ray images provide crucial support for diagnosis and decision-making in several diverse
clinical applications. However, X-ray images may be corrupted by statistical noise, thus
gravely deteriorating the quality and raising the difficulty of diagnosis [17][4]. Therefore,
X-ray denoising is mandatory for improving the quality of raw X-ray images and their
relevant clinical information content and analysis.
Figure 1.1: Performance of existing medical image denoising methods in removing image
noise. (a) Noisy input, (b) Result obtained by BM3D [1], (c)Result obtained by DnCNN
[2]. Source by: (https://www.kaggle.com/mateuszbuda/lgg-mri-segmentation).
4
Figure 1.2: Denoising results for Chest X-Ray Dataset [3] for Gaussian noise with
standard deviation of 25. Left to Right (a) noisy image (b) denoisied image using an
unsupervised cleaning. Source: Shamshad et al. [4].
1.2 Thesis Overview
The rest of thesis is organized as follows:
• Chapter 2 briefly presents the basic concepts and terminologies used in image
restoration studies which are used throughout the thesis.
• Chapter 3 presents a comprehensive survey of denoising algorithms developed till
2017.
• Chapter 4 describes the Deep Image Prior concepts, it’s limitations and related
work developed over the DIP ideology.
• Chapter 5 presents an alternative perspective towards image denoising with the
help of two important ideas which are discussed in detail. Finally, the proposed
methodology is introduced.
• Chapter 6 hashes out the experimental setup and analysis for the proposed single
image denoising methodology.
• Chapter 7 presents the conclusion and discusses some future work directions.
Chapter 2
Background
In this chapter, we briefly summarize the fundamental concepts and definitions pivotal
to understanding the rest of the thesis work and which we might revisit them frequently
in further chapters.
2.1 Deep Learning
Deep Learning is a sub-domain under the umbrella of machine learning concerned with
algorithms, which aims to imitate the structure and functionality of the human brain
called artificial neural networks (ANNs) with representation learning [18]. More specif-
ically, a neural network is inspired by a neuron, and in machine learning, it is an infor-
mation processing technique that uses the same concept of biological neural networks
but not identical to it [19], the analogy shown in Figure 2.1. There are multiple deep
learning architectures such as deep neural networks (DNNs), recurrent neural networks
(RNNs), and convolutional neural networks (CNNs) that have been applied to various
fields, including computer vision, machine vision, speech recognition, natural language
processing, audio recognition, social network filtering, machine translation ,and bioinfor-
matics, where they have produced remarkable results comparable or surpassing human
expert performance [18]. The ”learning” in the terminology of ”deep learning” can be
either supervised, unsupervised or semi-supervised, whereas the term ”deep” refers to
the number of layers through which the input is transformed.
5
6
Figure 2.1: ANN inspired by biological neural networks. The inputs are denoted by
x1, x2...xn at the dendrites and outputs by y1, y2, ...yn at the axon terminal ends. Source:
Wikipedia[5].
2.2 Inverse Problems
Inverse problem is a procedure of calculating from a set of observations the causal
factors that produced them: for example calculating the density of the Earth from
measurements of its gravity field [20]. If the information given by a measurement is
incomplete (incorrect or improper), then a problem is ill-posed [21]. Thus, Inverse
problems try to quantify when a problem is ill-posed and to what degree, and extract
maximum information under practical circumstances. Inverse problems occur in many
applications, such as image denoising, image deblurring, inpainting, super-resolution
etc. [22]. Figure 2.2 depicts a forward problem with respect to an inverse problem.
2.2.1 Image Denoising
Image denoising refers to removal of noise from a noisy image, so as to restore the true
image. The aim is to recover meaningful information from noisy images in the process
of noise removal to obtain high quality images is, which is an important open research
problem. The primary reason for this is that from a mathematical perspective, image
denoising is an inverse problem and its solution is not unique.
Also, Image restoration and image denoising are different terminologies where image
7
Figure 2.2: Inverse problem. Source: caltech GE193[6]
denoising is a type of image restoration problem. There are several types of image
restoration problems for example: super resolution, inpainting and image denoising is
also one of them. Therefore, algorithms that solve image restoration problems will also
be applicable for image denoising problems. The work in this thesis is centered around
image denoising.
2.2.2 Image Restoration Problem Formulation
The problem of image restoration can be traditionally formulated as [23] -
y = Hx + c (2.1)
where x ∈ Rn
represents the unknown original image, y ∈ Rm
represents observa-
tions and H is an m×n degradation matrix and c ∈ Rm is a vector of i.i.d (independent
and identically distributed) Gaussian random variables with mean as zero and standard
deviation of σc. Thus, as explained above the Equation 2.1 can represent different image
restoration problems. It can represent an image denoising problems when H is the n×n
identity matrix In. It depicts image in-painting when H is a selection of m rows of In
and image deblurring when H is a blurring operator [23].
8
2.3 Deep Neural Networks
A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers
between the input and output layers [24][25][18]. DNNs, which employ deep architec-
tures in NNs can represent functions that have higher complexity if the units in a single
layer and numbers of layers are increased [26]. If enough labeled training datasets and
suitable models are given, deep learning approaches can help humans establish mapping
functions for operation convenience. In this work, which pivots around understanding
the theoretical foundations of one such deep learning network, we will mention the use
of CNNs and ResNets. The latter is detailed in further sections below.
2.3.1 ResNets
ResNet, short for Residual Network is a specific type of neural network that was intro-
duced in 2015 by He et al., 2015. [7] they introduced a residual learning framework to
ease the training of networks that are substantially deeper than those used prior to 2015
and had explicitly reformulated the layers as learning residual functions with respect to
the layer inputs, instead of learning unreferenced functions [7].
Figure 2.3: Residual learning: a building block. (Source: He et al. [7])
The popularity of ResNets stems from that fact that it solved a big open challenge.
When deeper networks are able to start converging, a degradation problem had been
exposed: with the network depth increasing, accuracy gets saturated and then drops
rapidly [7]. Surprisingly, such degradation is not caused by overfitting. Adding more
layers to an appropriately deep model leads to higher training error, as reported in
9
[27][28]. In 2015, He et al. [7] empirically showed that there is a maximum threshold
for depth with the traditional CNN model.
Hence, [7] solved the degradation problem by introducing a deep residual learning
framework where a basic unit is called a residual block as depicted in Figure 2.3. Instead
of assuming that each of few stacked layers directly fit a desired underlying mapping,
they explicitly let these layers fit a residual mapping shortcut. Practically, this idea was
realised by feed-forward neural networks with connections that skip one or two layers
[7][29][30] called “shortcut connections”. These shortcut connections simply perform
identity mapping, and their outputs are added to the outputs of the stacked layers
(Figure 2.3). One of the biggest advantage of this approach is that these identity
shortcut connections add neither extra parameter nor computational complexity. Thus,
ResNets are heavily used to solve a variety of modern day tasks.
2.3.2 GANs
Ian J. Goodfellow et al., 2014 [31] proposed a new generative model estimation proce-
dure - an adversarial nets framework, where the generative model is pitted against an
adversary - a discriminative model that learns to ascertain whether a sample is from the
model distribution or the data distribution. This is analogous to a team of counterfeits
trying to produce some fake items and police who is trying to detect the counterfeit
items. The competition in this game drives both teams to improve their methods until
the counterfeits are indistinguishable from the genuine articles. Thus, two neural net-
works compete with each other (in the form of a zero-sum game, where one network’s
gain is another network’s loss) [32] and this model architecture or framework is called
Generative adversarial Network (GANs). More formally, the GAN model architecture
involves two sub-models; a generator model for generating new examples and a discrim-
inator model for classifying whether generated examples are real, from the domain, or
fake, generated by the generator model [33]. Generator model is used to generate new
plausible examples from the problem domain. Discriminator is a model that is used to
classify examples as real (from the domain) or fake (generated).
Although, originally proposed as a form of generative model for unsupervised learn-
ing, GANs have also proven effective for semi-supervised learning [34], fully supervised
learning [35], and reinforcement learning [36]. A more standardized approach for GAN
10
framework called Deep Convolutional Generative Adversarial Networks, or DCGAN,
that led to more stable models was later formalized by Alec Radford, et al. [37] in 2015.
2.4 Autoencoder
An autoencoder is a special type of neural network (NN), which is mainly designed
to encode the input into a compressed, meaningful representation, and then used to
decode it back such that the reconstructed input is as comparable as possible to the
original one [8]. They are an unsupervised learning technique where we leverage neural
networks for the task of representation learning. Specifically, one designs a neural
network architecture such that they can impose a bottleneck in the network which
forces a compressed knowledge representation of the original input. If the input features
were each independent of one another, this compression and subsequent reconstruction
would be a very difficult task. However, if some sort of structure exists in the data (i.e.
correlations between input features), this structure can be learned and consequently
leveraged when forcing the input through the network’s bottleneck [38]. They have
been first introduced in the 1980s by the Hinton and the PDP group [39] as a NN
that is trained to reconstruct its input. Mathematically, their main task of learning an
”informative” representation of data that can be used for various implications can be
formally defined [8][40] as to learn functions A : Rn → Rp and B : Rp → Rn that satisfy:
argminA,B E[∆(x, B ◦ A(x))] (2.2)
where E is the expectation over the distribution of x, and ∆ is the reconstruction loss
function, that measures the distance between the output of the decoder and the input.
The loss function is usually set to be the l2-norm [40].
Usually, A and B are neural networks [41]. But for the special case that A and B are
linear operations, it is called a linear autoencoder [42][40]. If in a linear autoencoder we
also drop the non-linear operations, then this autoencoder would attain the same latent
representation as Principal Component Analysis (PCA) [43]. Therefore, an autoencoder
is a generalization of PCA, where instead of finding a low dimensional hyperplane in
which the data is found, it is able to learn a non-linear manifold [44]. Thus, While con-
ceptually simple, autoencoders are quite popular and play an important role in machine
11
learning. For training autoencoders, it can be done gradually layer by layer or they can
be trained end-to-end. In the layer by layer case (or latter case), they are ”stacked”
together, that leads to a deeper encoder. In [45], this is done with convolutional autoen-
coders, and in [46] with denoising autoencoder. We will revisit autoencoders in chapter
5 and 6.
Figure 2.4: An example of an autoencoder. The input image is encoded to a compressed
representation and then decoded. (Source: Bank et al. [8])
2.5 Noise
Image noise is random variation of color information or brightness in images, and is
generally an aspect of electronic noise [47]. It tells unwanted information in digital
images and obscures the desired information. Noise produces undesirable effects such
as artifacts, unrealistic edges, unseen lines, corners and blurred objects. There are
multiple sources of noise in images, and these noises come from various aspects such as
image acquisition, transmission, and compression [48].
2.5.1 Noise models
There are different types of noise models, but we mention only three popular noise
models: Gaussian, salt-pepper and uniform noise. Also, there are different processing
algorithms for different types of noise models. For any input image, we model noisy
image for additive noise as -
g(x) = I(x) + v(x) (2.3)
12
I(x) is the original image without any noise, v(x) is the additive noise model and g(x)
is the input image with noise. x is set of pixels in the input image.
1. Gaussian Noise: Gaussian noise generally happens in the analog signal in the
electronics of the camera. It can be modeled as additive noise and acts on the
input image I to produce a degraded image y :
y = I + ση η ∼ N(0, 1) (2.4)
where σ is standard deviation [49][50][51]. Example of denoising algorithm for this
type of noise: Gaussian filtering.
2. Salt and pepper Noise: this impulse noise corresponds to random pixels which are
either saturated or turned off. It can happen in equipment with electronic spikes,
and we can model this as:
y =



I with probability p
b with probability 1-p
(2.5)
where b ∼ Ber(0.5) is a Bernoulli variable of parameter 0.5. Algorithms used for
image recovery from this type of noise - median filtering, mean filtering [52][47].
3. Shot noise: or the photon shot noise is the dominant noise in the brighter parts
of an image from an image sensor and is typically caused by statistical quantum
fluctuations, i.e., disparity in the number of photons observed at a given exposure
level [47]. The root-mean-square value of shot noise is proportional to the square
root of the image intensity, and the noises at different pixels are not related to one
another. This noise model follows a Poisson distribution, which except at very
high intensity levels approximates a Gaussian distribution.
4. Speckle noise: is a granular noise that exists inherently in an image and corrupts
its quality. This noise can be generated by multiplying random pixel values with
different pixels of an image [48]. A fundamental challenge in optical and digital
holography is the presence of speckle noise in the image reconstruction process.
13
2.6 Evaluation metrics
The aim of a denoising algorithm is to recover the original image as much as possible from
its noise-corrupted version. To evaluate denoising algorithms, different image quality
assessment measurements have been adopted to compare the denoised estimation and
ground truth high-quality images. Below, three popular representative, quantitative
measurements are discussed, amongst which PSNR is most commonly used metric.
2.6.1 MSE
Mean Squared Error (of a process for estimating an unobserved quantity) of an estimator
measures the average squared difference between the actual and estimated values. MSE
is equivalent to the expected value of the squared error loss and signifies the quality of
an estimator. It is always non-negative, and values closer to zero are better.
For a given noise-free m × n monochrome image I and its noisy approximation say
K, mathematically MSE can be defined as [53]-
MSE =
1
mn
m−1
X
i=0
n−1
X
j=0
[I(i, j) − K(i, j)]2
(2.6)
2.6.2 PSNR
Peak signal-to-noise ratio (PSNR) is a term that signifies the ratio between the max-
imum power of a signal and the power of contaminating noise that affects the fidelity
of its representation [53]. PSNR is defined via the mean squared error (MSE). Given
the ground truth image I and denoised estimation K, based on MSE, the definition of
PSNR is:
PSNR = 10 log10
MAX2
I
MSE

(2.7)
In the above equation, MAXI is the maximum possible pixel value of the image.
This value is 255 when the pixels are represented using 8 bits per sample.
While both MSE and PSNR are well accepted and are heavily used in several appli-
cations, they are not associated (or correlated) well with the visual perception of human
vision system, which is highly non linear and complex [54][55][56]. Thus, they are not
14
a good fit to measure the perceptual similarity between two images. Yet, PSNR is still
the most commonly used index to compare two images.
2.6.3 SSIM
Besides the MSR and PSNR, perceptual quality measurements have also been proposed
to evaluate denoising algorithms. One of the representative measurements is the struc-
tural similarity (SSIM) index [57]. The SSIM is a procedure for estimating the perceived
quality of digital television and cinematic pictures, as well as other kinds of digital im-
ages and videos. This metric is used for measuring the similarity between two images
[58].
The SSIM index can be calculated on various windows of an image. The measure
between two windows x and y of common size N × N is [58]:
SSIM(x, y) =
(2µxµy + c1)(2σxy + c2)
(µ2
x + µ2
y + c1)(σ2
x + σ2
y + c2)
(2.8)
where µx, µy are averages of x and y, σx, σy are variance of x and y, σxy co-variance,
c1 = (k1L)2, c2 = (k2L)2, c2 = (k2L)2 two variables to stabilize the division with weak
denominator, L is the dynamic range of the pixel-values, k1 = 0.01 and k2 = 0.03 by
default [57][58].
Above SSIM formula is based on three comparison measurements between the sam-
ples x and y; luminance (l), contrast (c), and structure (s).
l(x, y) =
2µxµy + c1
µ2
x + µ2
y + c1
(2.9)
c(x, y) =
2σxσy + c2
σ2
x + σ2
y + c2
(2.10)
s(x, y) =
σxy + c3
σxσy + c3
(2.11)
where c3 = c2/2. thus, using above 3 definitions, equation 2.3 can be rewritten as
(reference: [58]:
SSIM(x, y) =

l(x, y)α
· c(x, y)β
· s(x, y)γ

(2.12)
with setting weights α = β = γ = 1 to obtain similar form as equation 2.3.
Chapter 3
Image Denoising Algorithms:
Review
In this chapter we attempt to capture the research work and methods developed till
now for the open challenge of image denoising.
There exists several ways to classify existing image denoising algorithms. The three
popular approaches to classify them are:
• Inspired from image processing field concepts [59] - Spatial domain, Transform
domain and neural network (NN) based methods
• based on popular families in image restoration methods [10] - learning-based meth-
ods and learning-free methods
• based on how the an image prior is exploited to generate high-quality estimation
wrt to an input image [57]- Implicit and Explicit methods
Selecting the more intuitive categorization of the three, we will classify existing
denoising algorithms using spatial domain, transform domain and NN based methods.
Furthermore, we discuss the prior work in chronological order and as the classical spatial
and transform domain based algorithms have been thoroughly reviewed in previous
papers [15], [60] hence, we focus more on recently proposed NN based algorithms.
15
16
3.1 Spatial domain methods
Spatial domain technique is a traditional denoising method. It is a technique that is
directly applied to images in the form of spatial filters for noise removal [61]. Spatial
domain methods can be further sub-categorized into - spatial domain filtering (SDF)
and variational denoising methods [59].
3.1.1 Spatial domain filtering
Spatial domain filtering methods can also be grouped as implicit methods as per point
(2) above, and they can be divided into further two classes - linear and non-linear
filtering. Linear filters tend to blur sharp edges, destroy lines and other fine image
details, and perform poorly in the presence of signal-dependent noise [59]. For example,
a mean filter (linear filer) is optimal for Gaussian noise in the sense of mean square
error, but it tends to over-smooth images with high noise. Wiener filter was introduced
to combat this advantage, but it also can easily blur sharp edges.
Whereas by using non-linear filters, such as median filtering [62][63] and weighted
median filtering [64], noise can be suppressed without any identification. For example,
Bilateral filtering [65] is widely used for image denoising As it is a non-linear, edge-
preserving, and noise-reducing smoothing filter.
SDF methods, in general, adopt priors of high-quality images implicitly, where the
priors are ingrained into specific restoration operations. Such an implicitly modeling
strategy was used in most of the early years’ image denoising algorithms, some of which
discussed above [65][66][67][57].
Based on the assumptions of suprior quality images, heuristic operations have been
designed to generate estimations directly from the degraded images. For example, based
on the smoothness assumption, filtering-based methods.
3.1.2 Variational denoising methods
Besides implicitly embedding priors into restoration operations, variational denoising
methods explicitly characterize image priors and subsequently use the Bayesian method
to produce high quality reconstruction results. Having the degradation model p(y|x)
and specific prior model p(x), different estimators can be used to estimate latent image
17
x. One popular approach is the maximum a posterior (MAP) estimator, where
x̂ = arg max
x
p(x|y) = arg max
x
p(y|x)p(x) (3.1)
(using Bayes theorem), with which we seek, given the corrupted observation and
prior, the most probable estimation of x. In the case of AWGN (Additive white Gaus-
sian Noise), the above equation 3.1 can be reformulated as an objective function as a
summation of a data fidelity term (least squares) and a regularizer (more details for the
same discussed in ahead chapters).
Thus, for the variational denoising methods, the key is to find a suitable image
prior for example, some successful prior models include gradient priors, non-local self-
similarity (NSS) priors, sparse priors, and low-rank priors [59].
Total variation (TV) regularization
TV regularization is defined on the statistical fact that natural images are locally
smooth, and the pixel intensity gradually varies in most regions [59]. TV regularization
uses a Laplacian distribution to model image gradients, resulting in an l1 norm penalty
on the gradients of the estimated image. Mathematically it can be defined it as [68] :
RTV (x) = ||∇x||1 where ∇x is gradient of x.
Total variation is one of the most extensively used image priors that promotes spar-
sity in image in image gradients, effectively calculate the optimal solution and can also
retain sharp edges. Although, it has been shown to be beneficial in a number of applica-
tions [69][70][71] and is one of the most notable methods for image denoising, it has few
limitations. The 3 main disadvantages are that 1) textures tend to be over-smoothed 2)
flat areas are approximated by a piece-wise constant surface resulting in a stair-casing
effect and 3) the resultant image suffers from losses of contrast [72][68][73][59].
Extensive studies have been conducted to improve on TV regularizer performance in
image smoothing by adopting partial differential equation while some have also proposed
wavelet filters for analysis sparse filters [74][75]. Beck et al. [76] proposed a fast gradient-
based method for constrained TV, which is also a generic framework for covering other
types of non-smooth regularizers. [77] proposed a statistical model to structure the
heavy tailed distribution of coefficients via robust penalty function - lp norm and [78]
introduced a normalised sparsity measure.
18
3.2 Transform domain methods
Conversely, with spatial domain filtering methods, transform domain (TD) filtering
methods first transform the given noisy image to another domain and then apply a
denoising procedure on the transformed image based on the different image and noise
characteristics (larger coefficients denote the high-frequency part, i.e., the details or
edges of the image, whereas smaller coefficients denote the noise). These methods
operate on an important observation that the characteristics of image information and
noise are different in the transform domain. Furthermore, the transform domain filtering
methods can be subdivided based on the chosen basis transform functions - it may be
data-adaptive or non-data adaptive [79].
3.2.1 Data adaptive methods
Examples of data-adaptive methods are Independent component analysis (ICA) [80],
86] and PCA [81][82] functions. Both of them are adopted as the transform tools on the
given noisy images. One of the main disadvantages of data-adaptive methods is that
they have high computational costs because they use sliding windows and need a sample
of noise-free data or at least two image frames from the same scene [59]. However, it is
entirely possible that in some applications, it might be challenging to obtain noise-free
training data.
3.2.2 Non Data adaptive methods
The non-data adaptive TD filtering methods can additionally be subdivided into two
domains - spatial-frequency domain and wavelet domain. We will not discuss the spatial-
frequency domain in this work. But we will briefly mention the wavelet transform below
because it is one of the most researched transform techniques. [83] breaks down the
input data into a scale-space representation. Also, it has been proved that wavelets can
successfully remove noise while preserving the image characteristics, regardless of its
frequency content [84][85][86][87][59].
19
3.2.3 Block-matching and 3D filtering: BM3D
Bm3D [1] is a non-local, adaptive non-parametric filtering image denoising strategy
based on an enhanced sparse representation in the transform domain. The enhancement
of the sparsity is attained by grouping similar 2D fragments of the image into 3D
data arrays, which are ”groups”. To deal with these 3D groups, a special procedure
called collaborative filtering is used. This procedure includes three consecutive steps:
the 3D transformation of a group, shrinkage of transform spectrum, and inverse 3D
transformation. Thus, the 3D estimate of the group is obtained, which consists of an
array of jointly filtered 2D fragments. Due to the similarities between the grouped
blocks, the transform can achieve a highly sparse representation of the true signal so
that the noise can be well separated by shrinkage. Thus, collaborative filtering exposes
even the finest details shared by grouped fragments, and at the same time, it preserves
the important unique features of each individual fragment [1][88].
Figure 3.1: Scheme of the BM3D algorithm. (credits: Marc Lebrun [9])
This is one of the most popular, powerful, and effective denoising methods and
has been state-of-art until recently. After the original work of [1], many improved
versions of BM3D were also developed [89][90]. [90] proposed the block-matching and
4D filtering (BM4D) method. Dabov et al., 2007 [91] proposed an improvement of the
BM3D method for color image denoising that exploits filtering in a highly sparse local
3D transform domain in each channel of a luminance-chrominance color space. Further,
many follow-up works combined the sparse prior and NSS prior [92]. [93] collected non-
local similar patches to solve the group-sparsity problem to achieve better denoising
results. [94][57] proposed a non-local centralized sparse representation model in which
20
the mean value of the representation coefficients is pre-estimated based on the patch
groups.
3.3 Deep Neural Network methods
The original deep learning technologies were first used in image processing in the 1980s
[95] and were first used in image denoising by Zhou et al. [96][97][98]. After that, a
feed forward network was used to reduce the high computational costs and to make a
trade-off between denoising efficiency and performance [99]. The feed-forward network
can smooth the given corrupted image by Kuwahara filters [100], which were similar to
convolutions. Although these techniques were effective, these networks did not allow
the addition of new plug-and-play units, which restricted their generalization abilities
and usage in practical applications.
To overcome these limitations, Convolutional neural networks [101] were proposed.
Although, they had a slow start since their introduction due to a number of then ex-
isting issues like vanishing gradients problem, activation functions such as sigmoid and
tanh’s high computation cost, lack of appropriate hardware for efficient computations,
etc. But after the inception of AlexNet in 2012, things changed, and deep network
architectures were widely applied in fields of video, natural language processing, speech
processing, etc. In recent years, several CNN-based denoising methods have been pro-
posed [102][103][104][2][105][59]. Compared to that of [106], the performance of these
methods has been greatly improved. Furthermore, numeral network based (or CNN-
based) denoising methods can be divided into two categories: multilayer perceptron
(MLP) models and deep learning methods [59]. We discuss both the categories briefly.
A multilayer perceptron (MLP) is a class of feed-forward artificial neural networks
(ANN) [107]. Some popular MLP-based image denoising models include auto-encoders
proposed by Vincent et al. [103] and [104]. Chen et al. [102] introduced a feed-forward
deep network called the trainable non-linear reaction diffusion (TNRD) model, which
achieved a better denoising effect. In general, MLP-based methods are beneficial as they
work efficiently owing to fewer inference procedure steps. Moreover, because optimiza-
tion algorithms [108] have the ability to derive the discriminative architecture, these
methods have better interpretability. Although, on the other hand, interpretability can
21
increase the cost of performance [2]. [109] presented a patch-based denoising algorithm
that is learned on a large dataset with a MLP. Results on additive white Gaussian
(AWG) noise were competitive but the method had it’s limitations as generalisation
performance for different noise models was not competitive.
Deep networks were first applied to image denoising tasks in 2015 [110][111]. They
[110] pretrained stacked denoising auto-encoder, and to prevent co-adapting between
units, they applied dropout. Thus, the combination of dropout and stacked auto-encoder
enhanced performance and time-reduction in fine-tune phase. For addressing multiple
low-level tasks via a model, a denoising CNN (DnCNN) [2] consisting of convolutions,
batch normalization (BN) [112], rectified linear unit (ReLU) [113] and residual learning
(RL) [7] was proposed to deal with image denoising and, other image restoration tasks.
Therefore, CNN-based denoising methods were a success, and the reason is attributed
to their large modeling capacity and tremendous advances in network training, and
design. However, discriminative denoising methods at that time (2018) were limited
in flexibility, and the learned model was usually tailored to a specific noise level, and
methods like [2] did not generalize well for noise models other than AWGN on which
they were trained on.
To mitigate the above said limitations, [114] introduced a fast and flexible denoising
CNN (FFDNet) which presented different noise levels and the noisy image patch as the
input of a denoising network and improved denoising speed and process blind denoising.
A generative adversarial network (GAN) CNN blind denoiser (GCBD) [115] resolved the
problem of handling unpaired noisy images by first generating the ground truth, then
using the obtained ground truth as input into the GAN to train the denoiser. For more
complex corrupted images, a deep plug-and-play super-resolution (DPSR) method [114]
was developed to estimate blur kernel and noise and recover a high-resolution image.
Thus, in this section, we discussed existing neural network based methods for image
denoising, including both CNN and MLP based methods. Both of which are learning-
based approaches towards image denoising. In the next chapter, we will focus on
learning-free approaches like ”Deep Image Prior”, whose intuition is in direct contra-
diction to learning-based methods and are the foundation of most current state-of-art
image denoising methods.
Chapter 4
Deep Image Prior
In this Chapter, we introduce concepts of Deep Image Prior (DIP), discuss in detail the
model architecture and usage. after that we introduce few other works that are built
upon this concept.
4.1 Image Priors
Before discussing the seminal work of [10], it is crucial to understand the concept of
image priors. Image priors are prior information on any set of images [110] that one
can use in image processing (or computer vision) problems to enhance results, ease the
choice of processing parameters, resolve indeterminacies , etc. These priors, or their
approximations, can be converted into mathematical formulations and used as a part
of some central mechanism or procedure (or algorithm).
4.2 Deep Image Priors
Ulyanov et al. [10] shows that the structure of a generator network is sufficient to
encapsulate a great deal of low-level image statistics prior to any learning. What makes
this idea outstanding is that it is in direct contradiction to traditional understanding
that excellent performance of deep convolutions networks for denoising tasks is imputed
to their ability to learn realistic image priors from large number of example images.
With their novel concept there is no need to train a network on a dataset or even
22
23
perform any training at all.
To show this, we apply untrained Convolutional Neural Networks (ConvNets), and
instead of training a ConvNet on a large dataset of sample images, we fit a generator
network to a single corrupted image. This way, the network weights serve as a parame-
terization of the restored image, and the weights are randomly initialized and fitted to a
specific degraded image under a task-dependent observation model. In such a way, the
only information used to perform image reconstruction is contained in the single noisy
input image and the handcrafted structure of the network used for reconstruction [10].
The following four important empirical results are contributed by the paper [10] are-
• Low-level statistics can be captured by an untrained network.
• The model parameterization presents a high impedance to image noise, and hence,
it can be naturally used to filter out noise from a given image. Also, the DIP
method can work under blindness assumption.
• Choice of deep generator ConvNet architecture does have an impact on results as
different architectures impose rather different priors.
• DIP is similar to BM3D, one of the most popular transform techniques in the
respect that they both exploit self-structure and similarity.
4.2.1 Method
First, it is important to see the mathematical formulation for the idea of Deep image
priors upon which pivots some important future concepts ahead.
A function with a one-dimensional input and a multidimensional output can be
thought of as drawing a curve in space, and such a function is called a parametric
function (its input is called a parameter) [116]. Deep generator network is an example
of a parametric function that maps a code vector z to an image x [31].
x = fθ(z) (4.1)
If we were to interpret the neural network as a parameterization given in the above
equation (number) of the image x ∈ R3×H×W (channels, height, width), then in this
perspective, the code i.e z is a fixed randomized tensor z ∈ RC0×H0×W0
. The neural
24
network then can be viewed as mapping the parameters θ (weights from different layers,
bias of the filters in the networks) to the input image x.
To model conditional image distributions p(x|x0) where x is a natural image and x0
its corrupted version for image restoration problem (denoising), we can view such tasks
as energy minimization [10][117] problem -
x∗
= argmin
x
E(x; x0) + R(x) (4.2)
where E(x; x0) is a data term (that is task-dependent), and R(x) is a regularizer
which is not tied to a specific application and captures the generic regularity of natural
images. Instead of using an explicit regularizer term in equation Equation 4.2 the work
[10] proves that using implicit prior captured by the neural network parameterization
performs better for all image restoration tasks. Thus, the formulation can be re-written
as [10]:
θ∗
= argmin
θ
E(fθ(z); x0), x∗
= fθ∗ (z) (4.3)
The local minimizer θ∗ can be obtained using an optimizer such as gradient descent
and starting from a random initialization of the parameters θ. (Figure 4.1). As the
only information available to a restoration task is the noisy image x0, given the above
equation 4.3, the denoising process is obtained as x∗ = fθ∗ (z).
Another important fact demonstrated by [10] was that the choice of network ar-
chitecture has a major impact on how the solution space is searched by methods such
as gradient descent. This was an important observation because even though almost
any image can befitted by the model, they empirically show that the choice of archi-
tecture has a different impact on performance for different image restoration tasks.
and that the network resists “bad” solutions and descends much more quickly towards
naturally-looking images. The result is that minimizing Equation 4.3 either results in
a good-looking local optimum or, at least, that the optimization trajectory passes near
one as shown in Figure 4.1 [10].
Therefore, to understand this idea better, we can view this mathematically - given
a basic reconstruction task where the target image is x0 and we want to find values of
parameters θ∗ that reproduce the original image, the E(x; x0) term in equation 4.3, can
25
Figure 4.1: Image space visualization for DIP. Assume the problem of reconstructing an
image xgt from a degraded measurement x0. The image exemplified by denoising, the
ground truth xgt has non-zero cost E(xgt, x0)  0. Here, if run for long enough, fitting
with DIP will acquire a solution with near zero cost quite distant from xgt. However,
often the optimization path will pass close to xgt, and an early stopping (here at time
t3) will recover good solution. Source: Ulyanov et al. [10]
be modelled as the D2 distance that compares the generated image x with x0:
E(x; x0) = ||x − x0||2
(4.4)
⇒ minθ ||fθ(z) − x0||2
(4.5)
fθ(z) is (typically) a deep CNN with U-Shaped architecture [118]. One reason why
they are preferred is because one can draw samples from a DIP by taking random
values of the parameter θ and looking at the generated images fθ(z). Equivalently, this
means we can visualize the starting points of the optimization process (Eq. 4.3) before
we even fitting the parameters to the noisy input image. Also, [10] empirically shows
that the samples exhibit spatial structures and self-similarities, and the scale of these
structures depends on the network depth. Therefore, adding skip connections results
in images that contain structures of different characteristic scales, as is desirable for
modeling natural images. It implies then that natural that such architectures are the
most popular choice for generative ConvNets.
Also, this U-shaped architecture is an encoder-decoder (”hourglass”) network with
skip connections. Leaky-Relu is used for activation function and ADAM optimizer.
26
Figure 4.2: Figure depicting image restoration process using DIP. Starting from a ran-
dom weight θ0, one must iteratively update them in order to minimize the data term eq.
(4.3). At every iteration t the weights θ are mapped to an image x = fθ(z), where z is a
fixed tensor and the mapping f is a neural network with parameters θ. The image x is
used to calculate the task-dependent loss E(x, x0). The loss gradient w.r.t. the weights
θ is then calculated and used to update the parameters. Source: Ulyanov et al. [10]
Figure 5.1 in next chapter depicts the hourglass architecture. All works that improve
over DIP have similar architectural details as described here.
4.2.2 Important Results
The DIP paper shows experimental results for various image restoration tasks, including
single image denoising. They also claim that their model can work under the assumption
of blind denoising - where we do not know the noise model and can successfully recover
images from complex corruptions. The paper also has shown results for Gaussian noise
model but not for impulse noise, shot noise, or other noise models. One important
observation is that the DIP method’s performance for the Gaussian noise model is
similar to non-local learning-free approaches like CMB3D [91] but outperforms for non-
Gaussian noise models.
A hand-crafted prior method is in which we embed hard constraints and teach what
types of images are face, natural, etc., from the synthesized data. As no part of the
27
neural network fθ is learned from a dataset prior to this, such a deep image prior
is effectively hand-crafted, and empirically it is shown to outperform many standard
non-learning priors such as TV [119], BM3D [90] and few learning-based approaches.
4.3 Related Work
DIP method is similar to the works that exploit the self-similarity properties of natural
images and does not undergo training on hold-out set. In that regard, this approach
is similar to the BM3D approach (section 3.2.3) and the Non-Local means algorithm
[120]. These methods avoid any training and have hand-crafted priors.
The DIP work [10] has demonstrated a remarkable phenomenon that CNNs can
be used for solving image restoration problems without any offline training and exter-
nal data. Since then, many algorithms have been developed that improve upon this
extraordinary idea. [121] talks about an improvement using ”backprojection” or BP.
As we know, that image restoration tasks can be formulated as minimization of a cost
function, composed of a fidelity term and a prior term. We saw this type of formulation
in Chapter 2 Section 4.2.1, and it can be further generalized as follows:
min
x
l(x, y) + βs(x) (4.6)
where l is the fidelity term, s is the prior term, and β is a positive parameter that
controls the level of regularization [121].
Backprojection fidelity term was first introduced in 2018 by [23] as an alternative to
the widely used Least square (LS) fidelity term [121] : l(x, y) = 1
2||y −Ax||2
2 and empiri-
cally it has been shown that this fidelity term, for different priors that we have discussed
till now for example: TV, BM3D and pre-trained CNNs, yields better recoveries than
LS for badly conditioned A and requires fewer iterations of optimization algorithms. In
[121], they demonstrate the use of the BP fidelity term for improving the performance
of standard DIP (which uses LS fidelity term as the loss function). Although, the paper
only examines the performance for image deblurring tasks. It still remains to evalu-
ate this method’s performance for the remaining image restoration tasks such as image
denoising and super-resolution. In another line of work, Cheng et al. [122] show that
by conducting posterior inference using stochastic gradient Langevin dynamics, once
28
can avoid the need for early stopping, which is a major limitation of the current DIP
approach, and improve results for image restoration tasks. They prove that the DIP is
asymptotically similar to a stationary Gaussian process prior as the number of channels
in each layer of the network goes to infinity (in the limit) and derives the corresponding
kernel [122]. A Gaussian process is an infinite collection of random variables for which
any finite subset is jointly Gaussian distributed [122][123]. In another training-free ap-
proach, [124] presents a self-supervised learning method for single-image denoising. In
the introduced method, the network is trained with dropout on the pairs of Bernoulli-
sampled instances of the input image. The result is then estimated by averaging the
predictions generated from multiple instances of the trained model with dropout. The
authors empirically show that the proposed method not only significantly outperforms
existing single-image non-learning methods but also is competitive to the denoising net-
works trained on external datasets. Although, it still requires dropout and might be
unstable without early stopping.
4.4 Limitations of Deep Image Priors
DIP is one of the most popular methods for reconstruction tasks and was state-of-art
until recently. Though the methods have lots of merits, it has many limitations as
well. Some of the most significant unsolved issues for the DIP method is the need
for early stopping. Unless we employ early stopping, the PSNR (or SSIM) values will
drop after some number of iterations. [10] also employs early stopping and nearly most
of the research that came after DIP almost always use some level of early stopping.
Another issue is that there is obscurity towards the explanation of why does image
prior emerge, and why does these priors fit the structure of natural images so well.
There is no definitive answers for these questions that explain their effectiveness. With
respect to DIP’s practical applications, there are 2 main problems - it is extremely slow
and is unable to match or exceed the results of problem-specific methods [10]. For
example, we saw in Results subsection 4.2.2 that the performance for Gaussian noise
model for image denoising did not significantly outperform the non-local state-of-art
methods like CMB3D [91] or NLM [120]. In work by Ulyanov et al. [10], they have
used for image denoising task the image corruption rate is less than 0.5, so it remains
29
to see if this method is efficient for higher corruptions with respect to execution time
and performance trade-off.
Chapter 5
Rethinking Single Image
Denoising
Till now, we have discussed background concepts related on image denoising task and
few of the earlier methods that attempt to tackle it. We then discussed some deep
learning approaches to the open task specifically deep image priors. In this chapter we
will introduce few potential approaches to solve the limitations of DIP mentioned in last
the chapter and describe the experiments we conducted as a part of this thesis work1 .
5.1 Over-parameterisation in deep learning
Although DIP is extremely popular method since 2017, it has its limitations. As we
gain better insight and understanding regarding generalizing abilities of DNNs, a new
concept of learning over-parameterized models has emerged. It is becoming a crucial
topic in machine learning since 2017 [125][11][126][127].
Over-parameterization occurs when the number of learnable parameters is much
larger than the number of the training samples (or equivalently when we fit a richer
model than necessary). Deep artificial neural networks operate in this regime where
they have far more trainable model parameters than the number of training exam-
ples. Nevertheless, some of these models exhibit remarkably small generalization error,
i.e., the difference between “training error” and “test error”. The traditional learning
1
Work done under project investigator Taihui li and as a part of Sun research group.
30
31
theory suggests that when the number of parameters is large, some form of regular-
ization is needed to ensure small generalization error [128]. But recent research has
shown contradictory results to the traditional learning theory and has found that over-
parameterization empirically improves both optimization and generalization.
5.1.1 Overparameterisation v/s over-fitting?
It is important to remark that overparameterization and over-fitting are two different
phenomena, and over-parameterisation does not lead to over-fitting. When conventional
learning theory remarks that over-parameterization leads to over-fitting, the parame-
ters concerned are about hypothesis space from which the classifiers are constructed,
whereas, in deep neural networks, such parameters are those of the classifier construc-
tion part (fully connected layers). Thus, the learning theory concerns mostly about
the training of a classifier (learner) in classification tasks from a feature space, but it
tells little about the construction of the feature space itself. Thus, though we can use
the conventional theory to reason about generalization, we must to cautious when this
theory is applied for representation learning. This fact is demonstrated and debated
convincingly in [129].
5.1.2 Regularisation
Regularisation is a collective group of strategies that are explicitly designed to reduce
test error so that an algorithm will perform well not only on training data but also
on unseen inputs (test error). Many forms of regularization are available to the deep
learning practitioner. In fact, developing a more effective regularization strategy has
been one of the major research efforts in the field. There are two major types of
regularisation relevant to our thesis discussion - implicit and explicit regularisation.
• Regularization introduced either as an explicit penalty term or by modifying opti-
mization through, e.g., drop-outs, weight decay, or with one-pass stochastic meth-
ods can be referred to as explicit regularisation. A lot of work has been done
on understanding the effects of explicit regularisation on training data and deep
learning models’ performance.
32
• Implicit regularization would imply that some sort of regularization is being in-
troduced implicitly in a model. For example: in Neyshabur et al., 2014 [130], they
reason that the rationale behind low generalization error seen with overparame-
terized models is caused by an implicit regularization introduced by optimization
of the network. The optimization objectives for learning high capacity models
(which are overparameterized) have many global minima that fit training data
perfectly. Implicit regularization was adapted in DIP as well.
5.2 Low-rank matrix recovery problem
In this section, we will see the intuition behind the low-rank matrix recovery problem
and prior work that attempts to solve this issue. It is important to study this problem
because the network f(θ) in DIP [10] (chapter 4) has a U-shaped architecture and can
be viewed as a multi-layer, nonlinear extension of the low-rank matrix factorization
X = UUT . Therefore, DIP also inherits the drawbacks of the exact-parameterization
approach for low-rank matrix recovery. Namely, it requires either a meticulous choice
of network width or early stopping of the training process [11].
Low-rank matrices play an essential role in modeling and computational methods for
machine learning. They lay the foundation for both classical techniques such as principle
component analysis [131][132][133] as well as modern approaches to multi-task learning
[134][135] and natural language processing. Specifically, they have broad applications
in face recognition [136] (where saturation in brightness, self-shadowing, or specularity
can be modeled as outliers), video surveillance [136][11] (where the foreground objects
are usually modeled as outliers)and beyond.
However, the matrices we are eventually interested in can be extremely large. Al-
though memory costs (or costs for acquiring data) are getting cheaper, this will only
encourage bigger matrix sizes. This causes a number of issues, primarily that fully ob-
serving the matrix of interest can prove to be an impossible task. They can also be
corrupted with large errors. In such a case, we are left with a highly incomplete set of
observations, and unfortunately, in many of the most popular approaches to processing
the data in the low-rank matrices applications, we assume a fully sampled data set is
33
available. Another common assumption is that these approaches are generally not ro-
bust to missing/incomplete data. Thus, we have an inverse problem of retrieving the
full matrix from these incomplete observations. While such a recovery is not always
possible in general, but when the matrix is of low rank, it is possible to exploit this
structure and execute this kind of recovery in an astonishingly efficient manner [133].
Therefore, low-rank matrix recovery is an essential step towards solving many of the or
actual applications discussed above.
There are several methods to solve low-rank matrix recovery problems, out of which
the most commonly used in practice are low-rank approximation, low-rank recovery and
nuclear norm minimization, iterative hard thresholding, and alternating projections.
A long-established method for low-rank matrix recovery is via nuclear norm min-
imization. Such a method is provably accurate under certain incoherent conditions
[136][137]. However, minimizing nuclear norm involves expensive computations of sin-
gular value decomposition (SVD) of large matrices [133] (when n is large), which forbids
its application to problem size of practical interest. But these issues have been mitigated
with the recent development of matrix factorization methods [138][139]. These methods
reply on parameterizing the signal X ∈ Rn×n via factorization X = UUT . This gives
rise to a non-convex optimization problems with respect to U = ∈ Rn×r, where r is the
rank of X∗ [140].
5.3 Rethinking Single Image denoising: Main Ideas
Till now, fundamental concepts of over-parameterization, regularization, and low-rank
matrix recovery are discussed, all of which are vital towards understanding the next two
ideas presented in recent works by You et al. [11] and Jing et al. [141]. Finally, we will
present an ensemble approach that combines concepts of these two papers and argue
as to why the combination of these two ideas might provide insight for solving current
DIP limitations and be a step towards robust image recovery.
34
5.3.1 Image denoising via Implicit Bias of Discrepant Learning Rates
In [11], the authors discuss that the challenges associated with the exact-parameterization
methods can be simply and effectively dealt with via over-parameterization and dis-
crepant learning rates. Their arguments are supported by the recent results in [126][142]
for low-rank matrix recovery.
The success for the success of this method is the notion of implicit bias of discrepant
learning rates. The concept is that the algorithmic low-rank and sparse regularizations
need to be balanced for the purpose of discerning the underlying rank and sparsity.
In absence of of means for tuning a regularization parameter [11], the authors show
that the desired balance can be acquired by using different learning rates for different
optimization parameters. Below are the four subsections that summarize the main
results and algorithms discussed in this paper.
Double Over-Parameterization Formulation
In [11], the aim is to learn an unknown signal X∗ ∈ Rn×n from its grossly corrupted
linear measurements:
y = A(X∗) + s∗ (5.1)
where operator A(·) : Rn×n → Rm, and s∗ ∈ Rm is a sparse corruption vector (this
formulation is similar to discussed in Section on noise models). Equivalently, it is a
problem of recovering a rank-r (r  n) positive semi-definite matrix X∗ from its
grossly corrupted linear measurements as given in equation 5.1.
The work introduces a double over-parameterization approach for robust matrix
recovery, with double-parameterization of X = UUT and s = g ◦ g − h ◦ h:
min
U∈Rn×r0
,{g,h}⊆Rm
f(U, g, h) :=
1
4
||A(UUT
) + (g ◦ g − h ◦ h) − y||2
2 (5.2)
where the dimensional parameter r0 ≥ r. Practically, the choice of r0 depends on how
much prior information we have for X∗. It can be either taken as an estimated upper
bound for r or takes as a r0 = n with no prior knowledge.
Thus, the authors introduce a method that is based on over-parameterizing both
the low-rank matrix X∗ and the outliers s∗ and thereby leveraging implicit algorithmic
bias to find the correct solution (X∗, s∗).
35
Algorithmic Regularizations via Gradient Descent
In general, over-parameterization leads to under-determined problems which can have an
infinite number of solutions (analogous to linear algebra where the number of parameters
exceeds the number of equations). Thus, not all solutions of doubly over-parameterized
equation 5.2 will correspond to desired (X∗, s∗). The paper empirically and theoreti-
cally proves that the gradient descent iteration on equation 5.2 with properly selected
learning rates enforces implicit bias on the solution path and thereby automatically iden-
tifying the desired, regularized solution (X∗, s∗). Proof of the above ideas is beyond
the scope of this work.
Implicit Bias with Discrepant Learning Rates
Optimizing a linear multi-layer neural network via gradient descent leads to a low-rank
solution and this phenomenon is known as implicit regularization. It has been exten-
sively studied under the context of matrix factorization [126][143][144], linear regression
[145][146], logistic regression [125], and linear convolutional neural networks [127]. It
is well known that optimization algorithms like gradient descent introduces implicit bi-
ases (without early stopping) [125] and play a crucial role in generalization ability of
learned models. But it is still an open challenge of how to control the implicit regular-
ization of the gradient descent. Also, they theoretically found the value of penalty (λ)
as the algorithm approaches convergence for the unconstrained Lagrangian formulation
of rank-r matrix X∗. They further prove that controlling the implicit regularization
without explicitly adding any regularization term in equation 5.2 can be achieved by
adapting the ratio of learning rates. This observation directly contradicts conventional
optimization theory [147] that learning rates only affect algorithm convergence rate but
not the quality of the solution.
Extension to Natural Images Denoising
Finally, combining the ideas from all the above subsections to solve the image restoration
problem, the approach in [11] is inspired from [10]’s DIP method where they use the
formulation in the equation 5.1 with X = φ(θ), which is a deep convolutional network
and θ ∈ Rc represents network parameters.
36
Figure 5.1: Architecture used in [10] and also the base architecture for You et al. [11].
The hourglass (also known as decoder-encoder architecture. It sometimes has skip
connections represented in yellow. nu[i], nd[i], ns[i] correspond to the number of filters
at depth i for the upsampling, downsampling, and skip-connections respectively. The
values ku[i], kd[i], ks[i] correspond to the respective kernel sizes. Source: Ulyanov et al.
[10]
With respect to implementation details, the network φ(θ) is similar to the original
DIP work [10]. It has the same U-shaped architecture with skip connections where each
layer contains a convolutional layer, LeakyRelu layer, and a batch normalization layer.
The noise model for the images in the You et al. is salt and pepper noise.
Thus, the ideas presented in [11] are promising. Due to algorithmic bias of discrepant
learning rates, the need to tune network width or early termination is eliminated because
it alleviates the problem of over-fitting for robust image recovery. The above advantage
also enables the method to recover different image types with varying levels of corruption
levels without the need to tune the network learning parameters. That means for
different noise models you would not need to change the network width or other learning
parameters.
5.3.2 Implicit Rank-Minimizing Autoencoder
Autoencoders (AE) is a popular category of methods for learning representations with-
out requiring labeled data. We discussed this more in detail in Section 2.4. An essential
component of autoencoder methods is the method by which the information capacity
37
of the latent representation is minimized or limited. In [141], the rank of the covari-
ance matrix of the codes is implicitly minimized by depending on the fact that gradient
descent learning in multi-layer linear networks leads to minimum-rank solutions.
5.4 Proposed Methodology
Now that we understand the two main ideas presented in this chapter, we discuss an
ensemble method that is an amalgamation of these two ideas. Exploiting the double
over-parameterization of two low-dimensional structures in the image restoration ob-
jective along with discrepant learning rates to regularize the optimization path, we can
forgo the new need of early termination and parameter tuning. Whereas adding addi-
tional linear between layers encoder and decoder ensures the minimum possible rank
regularized solution which ensures convergence.
With respect to network details, it is not entirely similar in structure as You et al.
[11] which is also U-shaped architecture with skip connections like DIP. Our method
has a U-shaped architecture with residual connections and some more modifications.
But we make some additional changes. We change the constituents of the encoder-
decoder blocks (as shown in figure 5.1) from vanilla ConvNet Layers to ResNet. For
our method, we have multiple ResNet blocks, and each block consists of three each of
convolutional, batch normalization, and Leaky ReLU layers. This is in contrast to the
original DIP which did not have batch normalization layers or LeakyRelu. You et al.
[11] used ReLU activation instead. Also, we do not over-parameterize our noise model
as is done by You et al. [11]. Another change with respect to DOP (we will sometimes
refer You et al. work as ”Double Over-Parameterized Prior” - DOP) is that we added
additional three linear layers between the encoder-decoder blocks inspired by Jing et al.
[141]. In the next chapter, we share results with respect to different number of linear
layers. In default configuration, we always use l1 loss as compared to DIP and DOP,
which use MSE loss.
Thus, towards finding an effective solution for our problem of single image denoising,
we proposed a solution that ensures minimum rank regularized solution via double over
parameterization of both the minimum rank matrix signal and sparse corruption vector
and leveraging implicit algorithmic bias. In this chapter, we laid the foundations of this
38
approach with theoretical arguments. In next chapter, we detail the experimental steps
and results for our approach.
Chapter 6
Preliminary Experiments
In this chapter, we present results and analyze observations on some preliminary exper-
iments that were done during the course of this thesis work. The results stated below
are a part of an on going effort towards solving some of the limitations of single-image
denoising (as discussed in chapter 4 and 5).
6.1 Dataset
In this thesis work, we have focused on image restoration exclusively single-image de-
noising. For developing the algorithm, the popular set of images, which are also used
widely in almost all image denoising works are - a set of 8 images: Lena, peppers, F16-
GT, Barabara, Lake, Kodak Inc., baboon, snail. These are the standard bench-marking
images, and we have also tested our algorithm on a subset of this set - F16 and Lena.
6.2 System Configuration
We used Google Colab Pro, NVIDIA Tesla P100 GPU server, 16GB system RAM and
100GB of Google cloud storage for development and experimentation. In our work, we
have used the 1.8.1 Pytorch libraries.
39
40
6.3 Hyper-parameter tuning
We have tuned our models with all the possible combinations of values mentioned below
for the following parameters: learning rate, optimization functions, kernel size, activa-
tion functions, and input noise models. We have varied the learning rates from 1e-5 to
0.1. The optimizers used are Adam [148], and SGD [149], and the activation functions
used are ReLU and Leaky-ReLU [150]. The input noise models are Gaussian and salt
and pepper models for different corruption rates varying from 20% to 90%.
6.4 Results and Observations
In the figures below, we show results for three different methods - DIP, You et al. [11]
and our proposed approach (outlined in previous chapter). All results shared used SGD
as optimizer and learning rates of 0.01, τ = 1, and 0.1. For DIP, we use a variant of the
original method where loss function used is l1 loss. DIP-l1 gives far better results than
MSE DIP. Similarly, for You et al., and our proposed approach, we use l1 loss function.
We report results using PSNR values where for DIP-l1 we output on the last iterations
averaged using exponential sliding window (as reported in the paper, Average Output
of the model gives excellent results in Blind image denoising.) Below we show results
for our architecture for image Lena. Noise models used are salt and pepper with various
corruption levels (starting from 0.5 to 0.9) and Gaussian noise with σ = 25.
In Figure 6.1, we depict Lena, its 60% corrupted version with salt and pepper noise,
and the reconstructed image using our proposed denoising algorithm. Figures 6.4, 6.5,
and 6.6 show comparison in performance for all three methods for the same corruption
type and level.
We performed experiments with various corruption levels for all three methods and
saw that as corruption level increases (salt and pepper noise), the best PSNR values
decrease, and reconstruction becomes less noiseless. This fact is depicted in Figure 6.9.
Our method and You et al. [11] performs better with higher corruption levels than DIP
while our method performs mostly better than rest for corruption levels lower than 70%.
For higher corruption level You et al. [11] (or DOP method) is slightly better but at
average our method is more stable and consistent in performance than the rest. The
results reported in this chapter are all based on l1 loss, even for DIP we use it’s l1 loss
41
variant when the original paper [10] uses MSE loss. Another important observation is
that You et al. [11]’s method, if used with ADAM optimizer, needs early termination to
stop the dip in the PSNR values. The same trend is seen in our approach. Figure 6.7-6.8
depicts this observation for You et al. [11] (DOP) approach and shows that the network
starts learning noise in the absence of early stopping. The reconstruction PSNR worsens
as training progresses. Among another novelty of our method - addition of linear layers
added between the encoder-decoder block, there are important observations regarding
the effect of increasing linear layers. As Jing et al. prove that adding more linear layers
will increase the regularization effect, this claim was consistent with our observation as
well, along with another important result. For increasing the linear layers from 3 to 6
as the regularization effect increases, the time to reach best PSNR decreases (as shown
in Figure 6.10). Not only the network becomes comparatively more stable the number
of epochs needed to reach the best PSNR also decreases. On average of 3 separate runs,
the average number of epochs for 3 and 6 linear layers were: 19625, 15375 consecutively.
For 9 linear layers average of 3 different trails was 16375, which is lesser than 3 layers’
time but slightly greater than 6 layers’ epochs. With more number of trials, there is a
possibility that we might see a clearer trend. (PS: the Figures 6.9-6.10 report results
on F16-GT image).
42
Figure 6.1: From top left to bottom right: (a) The images in top row show ground truth
image Lena and (b) its noisy counterpart using 60% corruption level for salt and pepper
noise. The bottom row images show (c) Real image same as (a), (d) noisy image same
as (b) and the, (e) reconstructed image using our approach.
43
Figure 6.2: From top to bottom: (a) The image in top shows PSNR plot for correspond-
ing Figure 6.1, (b) loss plot (L1 loss) for reconstruction process using our approach.
44
Figure 6.3: From top left to bottom: (a) The images in top row show ground truth image
F16-GT and (b) its noisy counterpart using 80% corruption level for salt and pepper
noise. The bottom row image shows (c) the reconstructed image using our approach.
The best PSNR achieved is 21.8671. As we can see for higher corruptions we get poorer
performance.
45
Figure 6.4: From top left to bottom: (a) The images in top row show ground truth
image F16-GT and (b) its noisy counterpart using 50% corruption level for salt and
pepper noise. The bottom row image shows (c) the reconstructed image using [10]
DIP-l1 approach. The best PSNR achieved is 28.548 dB.
46
Figure 6.5: From top left to bottom right: (a) The images in top row show ground
truth image F16-GT and (b) its noisy counterpart using 50% corruption level for salt
and pepper noise. The bottom row image left shows (c) original image same as (a), (d)
shows noisy image same as (b) and, (e) is the reconstructed image using our approach.
The best PSNR achieved is 29.2449 dB.
47
Figure 6.6: From top left to bottom: (a) The images in top row show ground truth
image F16-GT and (b) its noisy counterpart using 50% corruption level for salt and
pepper noise. The bottom row image shows (c) the reconstructed image using You et
al. [11] approach (width = 128). The best PSNR achieved is 28.9 dB.
48
Figure 6.7: From top left to bottom: (a) The images in top row show ground truth
image F16-GT and (b) its noisy counterpart using 50% corruption level for salt and
pepper noise. The bottom row image shows (c) the reconstructed image using You et
al. [11] approach (width = 128) and ADAM optimiser (unless mentioned optimizer is
SGD). The image reconstructed becomes noisier as the training continues because the
network starts learning noise in absence of early termination.
49
Figure 6.8: (a) The image shows PSNR plot for corresponding Figure in 6.7. It clearly
demonstrates the need for early termination. The best PSNR achieved is 28.14 dB
before the dip.
50
Figure 6.9: (a) The image shows PSNR plots for different corruption levels for each of
the three methods discussed in last chapter. The line plots show best PSNR levels across
models for different corruption rates. The plot supports our claim in observations that
with increasing corruption rate the best PSNR level reached by our all three methods
decreases. Also, our method is the most consistent and stable in performance amongst
all three methods.
51
Figure 6.10: The plot depicts the effect of number of linear layers for our method when
the number goes from 3 to 9. For three independent trials, the average performance is
depicted via the dashed line and we see that the number of epochs to reach the highest
PSNR value decreases when we increase the number of layers from 3 to 6. From 6 to 9
layers we see a slight increase. The corruption level for all three trials is 50%.
Chapter 7
Conclusion and Discussion
In this thesis work, we aimed to study the fundamental theory for image restoration
tasks, their cause, and artifacts. We started with basic image restoration tasks math-
ematical formulation and dived into the conceptual theory of popular deep learning
methods such as CNN, autoencoders, and GANs, which are building blocks of many
state-of-art denoising algorithms. We also reviewed different noise models, what causal
effect they have on the restoration tasks, and relevant perpetual quality measurements
used to measure recovery performance, especially for image denoising. We attempted
to chronologically categorize the existing work in the image denoising field and analyze
the shortcomings of classical spatial transform methods that lead to transform domain
methods, followed by the current family of state-of-the-art performance obtained by
learned priors with deep neural networks - the NN-based methods. In chapter 4, we
introduced seminal work by Ulyanov et al. [10] on deep image priors which are the
foundation of the current learning-free untrained network methods. This work led to
to a shift in perspective from conventional theory and motivated newer research ideas
like [122] and [11]. We saw that despite deep image priors competitive performance
with non-local methods like BM3D, there exists a number of non-trivial limitations
that hinder a reliable adaption of this technology in practical scenarios, for example,
the need for early stopping. There also was an incomplete understanding with respect
to the method’s network generalization performance and quantification. Since 2017, we
saw growth in understanding as to why CNN’s or overparameterized networks general-
ized so well and the hardness of NN. Since then, many have come up with interesting
52
53
ideas improving untrained networks paradigms and making them more robust towards
their limitations. [11] proved that algorithm bias of discrepant learning rates in a doubly
over-parameterized network will eliminate any need for tuning learnable parameters and
early stopping. We exploit this work and in [141] to propose a new method for single
image denoising. We argue why this methodology might work and be a step towards
building a generalized image denoising algorithm
There are many limitations of this work. While we suggest a probable method-
ology exploiting the best of two ideas as discussed in chapter 5, it is not without its
shortcomings. As we saw in [10], there is an explicit need for early stopping. This
problem does not exist for [11] but only if we use gradient descent as an optimizer. For
Adam optimizer, the need for early stopping still exists, and this is also reflective in our
proposed method. Another major disadvantage is that [11] and [10] cited results for
impulse and Gaussian noise models, i.e., sparse corruptions (results can be extrapolated
for shot and speckle noise models), but these methods may not work for other types of
noise models (non-sparse corruptions) such as Defocus Blur, Elastic noise, etc. [151].
The DIP paper argues that their method can work for complex noise models in a blind
denoising process, but experimental analysis shows that the results are not competitive
as compared to the classical methods such as BM3D. Further, with more complex noise
models, SGD as an optimizer might not work, leaving us with a need for early stopping
yet again. For our proposed algorithm, we still need to explore its fallacies for more
noise models, corner cases and need to find a way to mitigate overfitting. The results
presented in the last chapter are were part of a preliminary investigation and did not
demonstrate the generalization capacity of the algorithm.
Thus, we proposed a method that eliminates the need of learnable parameter tun-
ing, early stopping and gives better results as compared to the DIP or [11] for simple
noise models. But our method is still not robust to all 19 types of noise models in
[151]. Many recent works suggest interesting ideas like TV regularized DIP [71] or
Bayesian DIP [122], and there is definite potential for extrapolating these methods with
our approach of deep linear autoencoders with skip connections, algorithmic bias of
discrepant learning rates, and SGD. Thus, it is still an open challenge to develop a
generalized algorithm that handles multiple corruptions in a stable manner.
References
[1] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian.
Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE
Transactions on Image Processing, 16(8):2080–2095, 2007.
[2] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond
a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE
Transactions on Image Processing, PP, 08 2016.
[3] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and
Ronald M. Summers. Chestx-ray8: Hospital-scale chest x-ray database and bench-
marks on weakly-supervised classification and localization of common thorax
diseases. 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Jul 2017.
[4] Fahad Shamshad, Muhammad Awais, Muhammad Asim, Zain ul Aabidin Lodhi,
Muhammad Umair, and Ali Ahmed. Leveraging deep stein’s unbiased risk esti-
mator for unsupervised x-ray denoising, 2018, 1811.12488.
[5] Wikipedia. Artificial neural network. https://en.wikipedia.org/wiki/
Artificial_neural_network#/media/File:Neuron3.png.
[6] Malcolm Sambridge. An introduction to Inverse Problems. http://web.gps.
caltech.edu/classes/ge193.old/lectures/Lecture1.pdf.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition, 2015, 1512.03385.
[8] Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders, 2021, 2003.05991.
54
55
[9] Marc Lebrun. An analysis and implementation of the bm3d image denoising
method. Image Processing On Line, 2:175–213, 08 2012.
[10] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In-
ternational Journal of Computer Vision, 128(7):1867–1888, Mar 2020.
[11] Chong You, Zhihui Zhu, Qing Qu, and Yi Ma. Robust recovery via implicit bias
of discrepant learning rates for double over-parameterization, 2020, 2006.08857.
[12] Joanna J. Bryson. The past decade and future of ai’s impact on society.
[13] Jack Copeland. The cyc project.
[14] Rupali Ahuja Rajshree. A general review of image denoising techniques.
[15] Mukesh Motwani, Mukesh Gadiya, Rakhi Motwani, and Frederick Harris. Survey
of image denoising techniques. 01 2004.
[16] S M A Sharif, Rizwan Ali Naqvi, and Mithun Biswas. Learning medical image
denoising with deep dynamic residual attention network. Mathematics, 8(12),
2020.
[17] Dang Thanh, Surya Prasath, and Hieu Le Minh. A review on ct and x-ray images
denoising methods. Informatica, 43:151–159, 06 2019.
[18] Wikipedia. Deep learning. https://en.wikipedia.org/wiki/Deep_learning#
Deep_neural_networks.
[19] Awan-Ur-Rahman. What is artificial neural network and how it mimics the human
brain?
[20] MIT. Inverse problems. http://web.mit.edu/2.717/www/inverse.html.
[21] Encyclopedia of Mathematics. Ill-posed problems. https://
encyclopediaofmath.org/wiki/Ill-posed_problems.
[22] Wikipedia. Inverse problem. https://en.wikipedia.org/wiki/Inverse_
problem#:~:text=An%20inverse%20problem%20in%20science,measurements%
20of%20its%20gravity%20field.
56
[23] Tom Tirer and Raja Giryes. Image restoration by iterative denoising and backward
projections. IEEE Transactions on Image Processing, PP, 10 2017.
[24] Y. Bengio. Learning deep architectures for ai. Foundations, 2:1–55, 01 2009.
[25] Juergen Schmidhuber. Deep learning in neural networks: An overview. Neural
Networks, 61, 04 2014.
[26] Weibo Liu, Zidong Wang, Xiaohui Liu, Nianyin Zeng, Yurong Liu, and Fuad
Alsaadi. A survey of deep neural network architectures and their applications.
Neurocomputing, 234, 12 2016.
[27] Kaiming He and Jian Sun. Convolutional neural networks at constrained time
cost, 2014, 1412.1710.
[28] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway net-
works, 2015, 1505.00387.
[29] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford Uni-
versity Press, Inc., USA, 1995.
[30] W. Venables and B. Ripley. Modern applied statistics with s fourth edition. 2002.
[31] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial
networks, 2014, 1406.2661.
[32] Wikipedia. Generative adversarial network. https://en.wikipedia.org/wiki/
Generative_adversarial_network.
[33] Jason Brownlee. A gentle introduction to generative adver-
sarial networks (gans). https://machinelearningmastery.com/
what-are-generative-adversarial-networks-gans/.
[34] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford,
and Xi Chen. Improved techniques for training gans, 2016, 1606.03498.
[35] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image
translation with conditional adversarial networks, 2018, 1611.07004.
Robustness in Deep Learning: Single Image Denoising using Untrained Networks
Robustness in Deep Learning: Single Image Denoising using Untrained Networks
Robustness in Deep Learning: Single Image Denoising using Untrained Networks
Robustness in Deep Learning: Single Image Denoising using Untrained Networks
Robustness in Deep Learning: Single Image Denoising using Untrained Networks
Robustness in Deep Learning: Single Image Denoising using Untrained Networks
Robustness in Deep Learning: Single Image Denoising using Untrained Networks
Robustness in Deep Learning: Single Image Denoising using Untrained Networks
Robustness in Deep Learning: Single Image Denoising using Untrained Networks
Robustness in Deep Learning: Single Image Denoising using Untrained Networks
Robustness in Deep Learning: Single Image Denoising using Untrained Networks
Robustness in Deep Learning: Single Image Denoising using Untrained Networks

More Related Content

Similar to Robustness in Deep Learning: Single Image Denoising using Untrained Networks

Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...SamanthaGallone
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsSandra Long
 
Efficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image RegistrationEfficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image RegistrationEnrique Muñoz Corral
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Cooper Wakefield
 
High Performance Traffic Sign Detection
High Performance Traffic Sign DetectionHigh Performance Traffic Sign Detection
High Performance Traffic Sign DetectionCraig Ferguson
 
Android Application for American Sign Language Recognition
Android Application for American Sign Language RecognitionAndroid Application for American Sign Language Recognition
Android Application for American Sign Language RecognitionVishisht Tiwari
 
Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on SteroidsAdam Blevins
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingGabriela Agustini
 

Similar to Robustness in Deep Learning: Single Image Denoising using Untrained Networks (20)

Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency Algorithms
 
btpreport
btpreportbtpreport
btpreport
 
Efficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image RegistrationEfficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image Registration
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...
 
Dissertation
DissertationDissertation
Dissertation
 
Upstill_thesis_2000
Upstill_thesis_2000Upstill_thesis_2000
Upstill_thesis_2000
 
High Performance Traffic Sign Detection
High Performance Traffic Sign DetectionHigh Performance Traffic Sign Detection
High Performance Traffic Sign Detection
 
Android Application for American Sign Language Recognition
Android Application for American Sign Language RecognitionAndroid Application for American Sign Language Recognition
Android Application for American Sign Language Recognition
 
Honours_Thesis2015_final
Honours_Thesis2015_finalHonours_Thesis2015_final
Honours_Thesis2015_final
 
thesis
thesisthesis
thesis
 
Grl book
Grl bookGrl book
Grl book
 
mscthesis
mscthesismscthesis
mscthesis
 
Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on Steroids
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
 
Big data-and-the-web
Big data-and-the-webBig data-and-the-web
Big data-and-the-web
 
Sona project
Sona projectSona project
Sona project
 
M.Sc thesis
M.Sc thesisM.Sc thesis
M.Sc thesis
 
JJenkinson_Thesis
JJenkinson_ThesisJJenkinson_Thesis
JJenkinson_Thesis
 
Intro photo
Intro photoIntro photo
Intro photo
 

More from Daniel983829

Robustness in Machine Learning Explanations Does It Matter-1.pdf
Robustness in Machine Learning Explanations Does It Matter-1.pdfRobustness in Machine Learning Explanations Does It Matter-1.pdf
Robustness in Machine Learning Explanations Does It Matter-1.pdfDaniel983829
 
Robustness and Regularization of Support Vector Machines.pdf
Robustness and Regularization of Support Vector Machines.pdfRobustness and Regularization of Support Vector Machines.pdf
Robustness and Regularization of Support Vector Machines.pdfDaniel983829
 
Robustness of Machine Learning Models Beyond Adversarial Attacks.pdf
Robustness of Machine Learning Models Beyond Adversarial Attacks.pdfRobustness of Machine Learning Models Beyond Adversarial Attacks.pdf
Robustness of Machine Learning Models Beyond Adversarial Attacks.pdfDaniel983829
 
Ultrasound image denoising using generative adversarial networks with residua...
Ultrasound image denoising using generative adversarial networks with residua...Ultrasound image denoising using generative adversarial networks with residua...
Ultrasound image denoising using generative adversarial networks with residua...Daniel983829
 
When medical images meet generative adversarial network - recent development ...
When medical images meet generative adversarial network - recent development ...When medical images meet generative adversarial network - recent development ...
When medical images meet generative adversarial network - recent development ...Daniel983829
 
Data preparation for artificial intelligence in medical imaging - A comprehen...
Data preparation for artificial intelligence in medical imaging - A comprehen...Data preparation for artificial intelligence in medical imaging - A comprehen...
Data preparation for artificial intelligence in medical imaging - A comprehen...Daniel983829
 
Generative Adversarial Networks for Robust Medical Image Analysis.pdf
Generative Adversarial Networks for Robust Medical Image Analysis.pdfGenerative Adversarial Networks for Robust Medical Image Analysis.pdf
Generative Adversarial Networks for Robust Medical Image Analysis.pdfDaniel983829
 
Application of generative adversarial networks (GAN) for ophthalmology image ...
Application of generative adversarial networks (GAN) for ophthalmology image ...Application of generative adversarial networks (GAN) for ophthalmology image ...
Application of generative adversarial networks (GAN) for ophthalmology image ...Daniel983829
 
Auditing AI models for verified deployment under semantic specifications.pdf
Auditing AI models for verified deployment under semantic specifications.pdfAuditing AI models for verified deployment under semantic specifications.pdf
Auditing AI models for verified deployment under semantic specifications.pdfDaniel983829
 
machine-learning-development-audit-framework-assessment-and-inspection-of-ris...
machine-learning-development-audit-framework-assessment-and-inspection-of-ris...machine-learning-development-audit-framework-assessment-and-inspection-of-ris...
machine-learning-development-audit-framework-assessment-and-inspection-of-ris...Daniel983829
 

More from Daniel983829 (10)

Robustness in Machine Learning Explanations Does It Matter-1.pdf
Robustness in Machine Learning Explanations Does It Matter-1.pdfRobustness in Machine Learning Explanations Does It Matter-1.pdf
Robustness in Machine Learning Explanations Does It Matter-1.pdf
 
Robustness and Regularization of Support Vector Machines.pdf
Robustness and Regularization of Support Vector Machines.pdfRobustness and Regularization of Support Vector Machines.pdf
Robustness and Regularization of Support Vector Machines.pdf
 
Robustness of Machine Learning Models Beyond Adversarial Attacks.pdf
Robustness of Machine Learning Models Beyond Adversarial Attacks.pdfRobustness of Machine Learning Models Beyond Adversarial Attacks.pdf
Robustness of Machine Learning Models Beyond Adversarial Attacks.pdf
 
Ultrasound image denoising using generative adversarial networks with residua...
Ultrasound image denoising using generative adversarial networks with residua...Ultrasound image denoising using generative adversarial networks with residua...
Ultrasound image denoising using generative adversarial networks with residua...
 
When medical images meet generative adversarial network - recent development ...
When medical images meet generative adversarial network - recent development ...When medical images meet generative adversarial network - recent development ...
When medical images meet generative adversarial network - recent development ...
 
Data preparation for artificial intelligence in medical imaging - A comprehen...
Data preparation for artificial intelligence in medical imaging - A comprehen...Data preparation for artificial intelligence in medical imaging - A comprehen...
Data preparation for artificial intelligence in medical imaging - A comprehen...
 
Generative Adversarial Networks for Robust Medical Image Analysis.pdf
Generative Adversarial Networks for Robust Medical Image Analysis.pdfGenerative Adversarial Networks for Robust Medical Image Analysis.pdf
Generative Adversarial Networks for Robust Medical Image Analysis.pdf
 
Application of generative adversarial networks (GAN) for ophthalmology image ...
Application of generative adversarial networks (GAN) for ophthalmology image ...Application of generative adversarial networks (GAN) for ophthalmology image ...
Application of generative adversarial networks (GAN) for ophthalmology image ...
 
Auditing AI models for verified deployment under semantic specifications.pdf
Auditing AI models for verified deployment under semantic specifications.pdfAuditing AI models for verified deployment under semantic specifications.pdf
Auditing AI models for verified deployment under semantic specifications.pdf
 
machine-learning-development-audit-framework-assessment-and-inspection-of-ris...
machine-learning-development-audit-framework-assessment-and-inspection-of-ris...machine-learning-development-audit-framework-assessment-and-inspection-of-ris...
machine-learning-development-audit-framework-assessment-and-inspection-of-ris...
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

Robustness in Deep Learning: Single Image Denoising using Untrained Networks

  • 1. Robustness in Deep Learning: Single Image Denoising using Untrained Networks A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Esha Singh IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science Ju Sun May, 2021
  • 2. © Esha Singh 2021 ALL RIGHTS RESERVED
  • 3. Acknowledgements I would first like to thank my advisor, Professor Ju Sun for providing me with an opportunity to be a part of his research lab and for his continuous support and guidance. This thesis work would not have been successful without his able advice, feedback and teachings. I would also like to thank Taihui Li for helping me with the experiments, his constant support, discussions, analysis and feedback, which helped me completion of my work and improving results. I would also like to thank the members of my thesis committee, Professor Hyun Soo Park and Professor Gilad Lerman. Finally, my deep and sincere gratitude to my family and friends for their uncondi- tional and unparalleled love and support. i
  • 4. Dedication To my mother and father, friends, and colleagues who have mentored and held me up along the way. ii
  • 5. Abstract Deep Learning has become one of the cornerstones of today’s AI advancement and research. Deep Learning models are used for achieving state-of-the-art results on a wide variety of tasks, including image restoration problems, specifically image denoising. Despite recent advances in applications of deep neural networks and the presence of a substantial amount of existing research work in the domain of image denoising, this task is still an open challenge. In this thesis work, we aim to summarize the study of image denoising research and its trend over the years, the fallacies, and the brilliance. We first visit the fundamental concepts of image restoration problems, their definition, and some common misconceptions. After that, we attempt to trace back where the study of image denoising began, attempt to categorize the work done till now into three main families with the main focus on the neural network family of methods, and discuss some popular ideas. Consequently, we also trace related concepts of over-parameterization, regularisation, low-rank minimization and discuss recent untrained networks approach for single image denoising, which is fundamental towards understanding why the current state-of-art methods are still not able to provide a generalized approach for stabilized image recovery from multiple perturbations. iii
  • 6. Contents Acknowledgements i Dedication ii Abstract iii List of Tables vii List of Figures viii 1 Introduction 1 1.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Image Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Image Restoration Problem Formulation . . . . . . . . . . . . . . 7 2.3 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 ResNets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.1 Noise models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 iv
  • 7. 2.6 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6.1 MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6.2 PSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6.3 SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Image Denoising Algorithms: Review 15 3.1 Spatial domain methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.1 Spatial domain filtering . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.2 Variational denoising methods . . . . . . . . . . . . . . . . . . . 16 3.2 Transform domain methods . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Data adaptive methods . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.2 Non Data adaptive methods . . . . . . . . . . . . . . . . . . . . . 18 3.2.3 Block-matching and 3D filtering: BM3D . . . . . . . . . . . . . . 19 3.3 Deep Neural Network methods . . . . . . . . . . . . . . . . . . . . . . . 20 4 Deep Image Prior 22 4.1 Image Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Deep Image Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.2 Important Results . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 Limitations of Deep Image Priors . . . . . . . . . . . . . . . . . . . . . . 28 5 Rethinking Single Image Denoising 30 5.1 Over-parameterisation in deep learning . . . . . . . . . . . . . . . . . . . 30 5.1.1 Overparameterisation v/s over-fitting? . . . . . . . . . . . . . . . 31 5.1.2 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Low-rank matrix recovery problem . . . . . . . . . . . . . . . . . . . . . 32 5.3 Rethinking Single Image denoising: Main Ideas . . . . . . . . . . . . . . 33 5.3.1 Image denoising via Implicit Bias of Discrepant Learning Rates . 34 5.3.2 Implicit Rank-Minimizing Autoencoder . . . . . . . . . . . . . . 36 5.4 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 v
  • 8. 6 Preliminary Experiments 39 6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.2 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.3 Hyper-parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.4 Results and Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7 Conclusion and Discussion 52 References 54 Appendix A. Glossary and Acronyms 68 vi
  • 9. List of Tables A.1 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 vii
  • 10. List of Figures 1.1 Performance of existing medical image denoising methods in removing image noise. (a) Noisy input, (b) Result obtained by BM3D [1], (c)Result obtained by DnCNN [2]. Source by: (https://www.kaggle.com/mateuszbuda/ lgg-mri-segmentation). . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Denoising results for Chest X-Ray Dataset [3] for Gaussian noise with standard deviation of 25. Left to Right (a) noisy image (b) denoisied image using an unsupervised cleaning. Source: Shamshad et al. [4]. . . . 4 2.1 ANN inspired by biological neural networks. The inputs are denoted by x1, x2...xn at the dendrites and outputs by y1, y2, ...yn at the axon terminal ends. Source: Wikipedia[5]. . . . . . . . . . . . . . . . . . . . . 6 2.2 Inverse problem. Source: caltech GE193[6] . . . . . . . . . . . . . . . . . 7 2.3 Residual learning: a building block. (Source: He et al. [7]) . . . . . . . . 8 2.4 An example of an autoencoder. The input image is encoded to a com- pressed representation and then decoded. (Source: Bank et al. [8]) . . . 11 3.1 Scheme of the BM3D algorithm. (credits: Marc Lebrun [9]) . . . . . . . . . . 19 4.1 Image space visualization for DIP. Assume the problem of reconstructing an image xgt from a degraded measurement x0. The image exemplified by denoising, the ground truth xgt has non-zero cost E(xgt, x0) > 0. Here, if run for long enough, fitting with DIP will acquire a solution with near zero cost quite distant from xgt. However, often the optimization path will pass close to xgt, and an early stopping (here at time t3) will recover good solution. Source: Ulyanov et al. [10] . . . . . . . . . . . . . . . . . 25 viii
  • 11. 4.2 Figure depicting image restoration process using DIP. Starting from a random weight θ0, one must iteratively update them in order to minimize the data term eq. (4.3). At every iteration t the weights θ are mapped to an image x = fθ(z), where z is a fixed tensor and the mapping f is a neural network with parameters θ. The image x is used to calculate the task-dependent loss E(x, x0). The loss gradient w.r.t. the weights θ is then calculated and used to update the parameters. Source: Ulyanov et al. [10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Architecture used in [10] and also the base architecture for You et al. [11]. The hourglass (also known as decoder-encoder architecture. It sometimes has skip connections represented in yellow. nu[i], nd[i], ns[i] correspond to the number of filters at depth i for the upsampling, downsampling, and skip-connections respectively. The values ku[i], kd[i], ks[i] correspond to the respective kernel sizes. Source: Ulyanov et al. [10] . . . . . . . . . 36 6.1 From top left to bottom right: (a) The images in top row show ground truth image Lena and (b) its noisy counterpart using 60% corruption level for salt and pepper noise. The bottom row images show (c) Real image same as (a), (d) noisy image same as (b) and the, (e) reconstructed image using our approach. . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.2 From top to bottom: (a) The image in top shows PSNR plot for cor- responding Figure 6.1, (b) loss plot (L1 loss) for reconstruction process using our approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.3 From top left to bottom: (a) The images in top row show ground truth image F16-GT and (b) its noisy counterpart using 80% corruption level for salt and pepper noise. The bottom row image shows (c) the recon- structed image using our approach. The best PSNR achieved is 21.8671. As we can see for higher corruptions we get poorer performance. . . . . 44 6.4 From top left to bottom: (a) The images in top row show ground truth image F16-GT and (b) its noisy counterpart using 50% corruption level for salt and pepper noise. The bottom row image shows (c) the recon- structed image using [10] DIP-l1 approach. The best PSNR achieved is 28.548 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 ix
  • 12. 6.5 From top left to bottom right: (a) The images in top row show ground truth image F16-GT and (b) its noisy counterpart using 50% corruption level for salt and pepper noise. The bottom row image left shows (c) original image same as (a), (d) shows noisy image same as (b) and, (e) is the reconstructed image using our approach. The best PSNR achieved is 29.2449 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.6 From top left to bottom: (a) The images in top row show ground truth image F16-GT and (b) its noisy counterpart using 50% corruption level for salt and pepper noise. The bottom row image shows (c) the recon- structed image using You et al. [11] approach (width = 128). The best PSNR achieved is 28.9 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.7 From top left to bottom: (a) The images in top row show ground truth im- age F16-GT and (b) its noisy counterpart using 50% corruption level for salt and pepper noise. The bottom row image shows (c) the reconstructed image using You et al. [11] approach (width = 128) and ADAM optimiser (unless mentioned optimizer is SGD). The image reconstructed becomes noisier as the training continues because the network starts learning noise in absence of early termination. . . . . . . . . . . . . . . . . . . . . . . . 48 6.8 (a) The image shows PSNR plot for corresponding Figure in 6.7. It clearly demonstrates the need for early termination. The best PSNR achieved is 28.14 dB before the dip. . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.9 (a) The image shows PSNR plots for different corruption levels for each of the three methods discussed in last chapter. The line plots show best PSNR levels across models for different corruption rates. The plot sup- ports our claim in observations that with increasing corruption rate the best PSNR level reached by our all three methods decreases. Also, our method is the most consistent and stable in performance amongst all three methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 x
  • 13. 6.10 The plot depicts the effect of number of linear layers for our method when the number goes from 3 to 9. For three independent trials, the average performance is depicted via the dashed line and we see that the number of epochs to reach the highest PSNR value decreases when we increase the number of layers from 3 to 6. From 6 to 9 layers we see a slight increase. The corruption level for all three trials is 50%. . . . . . . . . . 51 xi
  • 14. Chapter 1 Introduction Artificial Intelligence (AI) - is a phenomenon where intelligence is demonstrated by ma- chines, unlike the natural intelligence exhibited by humans, which involves consciousness and emotionality. For two decades, AI has been at the helm of revolutionary changes in the face of industrial, academic research, and development. The past decade, and notably the past few years, has been transformative for AI, not so much in terms of what we can do with this technology (theoretical) as what we are doing with it (applied) [12]. The main ideology behind AI has been the perfect emulation of human intelligence and an attempt to give them coherent reasoning. Can machines gain common sense? There are two fundamental paradigms to this problem. One is to provide a compre- hensive set of facts and rules encoding human knowledge (an undertaking by the Cyc Project since 1984 [13]). The other being facilitating the self-learning process of ma- chines, similar to how humans develop commonsense. The latter approach has shown great promise, with Deep Learning (DL) being the dominant tool to help machines gain perception of the world around them. With the abundance of work that exists in this sphere, it is not difficult to experience the power as well as various limitations of this technology. The lack of generality, data bias, fairness, and robustness to unforeseen situations are some of the well-known challenges in this field. The aim of this thesis work is to focus on one such particular challenge; Robust image recovery. It is a relevant issue that is still an open problem and is omnipresent in real-life situations. Self-driving vehicles, digital photography, medical image analysis, remote sensing, surveillance, and 1
  • 15. 2 digital entertainment are a few of the applications where due to unprecedented suscep- tibilities, existing solutions might not perform as expected. Robustness against natural corruptions or robustness in medical problems are some of the non-trivial open chal- lenges in the sphere of robustness in deep learning, and to tackle such big issues it is advantageous to break them into smaller sub-tasks. Thus, a small step towards handling those situations is to solve the classical yet active problem of robust image recovery un- der synthetic noise models. If one can solve this problem reliably, it can give us insight as to how to work our way towards the more significant hurdles. One of the rudimentary challenges in the field of image processing and computer vision is image denoising, where the underlying goal is to approximate the actual image by suppressing noise from a noise-contaminated version of the image. Image noise may be caused by several intrinsic (i.e., sensor) and extrinsic (i.e., environment) conditions which are often not possible to avoid in practical situations. Image denoising is a funda- mental yet active problem and still remains unsolved because noise removal introduces artifacts and unwanted effects such as blurring of the images. The focus of this thesis work is to summarize the fundamental concepts behind image denoising tasks, existing work in the field, qualitative analysis of state-or-art methods for this tasks and finally, present a probable approach with supportive arguments and preliminary experiments undertaken during the course of this research work. 1.1 Application Digital images play an essential role both in daily life applications such as satellite television, medical imaging application, computer tomography, as well as in areas of research and technology such as geographical information systems and astronomy. So it is not difficult to gauge the importance of recovering precise images. It is the first and vital step before images can be analyzed or used further. Thus, image denoising plays a vital role in a wide range of applications such as image restoration, image registration, visual tracking, image segmentation, and image classification, where obtaining the orig- inal image content is crucial for strong performance. It is important to develop effective denoising techniques in order to compensate for data corruption which is introduced when data is collected by imperfect instruments which are generally contaminated by
  • 16. 3 noise, issues with the data acquisition process [14], and interceding natural phenomena [15]. An important practical application for image denoising is in Medical Sciences. Med- ical images obtained from MRI are the most common tool for diagnosis in Medicine and are often influenced by random noise arising in the image acquisition process. Hence, noise removal is essential in medical imaging applications in order to enhance and recover fine-grained details that may be hidden in the data. Medical imaging including X-rays, Magnetic Resonance Imaging (MRI), Computer Tomography (CT), ultrasound, etc., are susceptible to noise due to reasons discussed in the last section. Hence, it important to recover original, high-quality, noiseless images. Image denoising in the field of Medicine is referred to as Medical image denoising and is a process of improving the perpetual quality of degraded noisy images captured with specialized medical image acquisition devices [16]. Figure 1.1 is an example of how existing MID methods illustrates deficiencies in large-scale noise removal from medical images and immensely fail in numerous cases [16]. Another use case for medical imaging applications is with respect to X-ray images. X-ray images provide crucial support for diagnosis and decision-making in several diverse clinical applications. However, X-ray images may be corrupted by statistical noise, thus gravely deteriorating the quality and raising the difficulty of diagnosis [17][4]. Therefore, X-ray denoising is mandatory for improving the quality of raw X-ray images and their relevant clinical information content and analysis. Figure 1.1: Performance of existing medical image denoising methods in removing image noise. (a) Noisy input, (b) Result obtained by BM3D [1], (c)Result obtained by DnCNN [2]. Source by: (https://www.kaggle.com/mateuszbuda/lgg-mri-segmentation).
  • 17. 4 Figure 1.2: Denoising results for Chest X-Ray Dataset [3] for Gaussian noise with standard deviation of 25. Left to Right (a) noisy image (b) denoisied image using an unsupervised cleaning. Source: Shamshad et al. [4]. 1.2 Thesis Overview The rest of thesis is organized as follows: • Chapter 2 briefly presents the basic concepts and terminologies used in image restoration studies which are used throughout the thesis. • Chapter 3 presents a comprehensive survey of denoising algorithms developed till 2017. • Chapter 4 describes the Deep Image Prior concepts, it’s limitations and related work developed over the DIP ideology. • Chapter 5 presents an alternative perspective towards image denoising with the help of two important ideas which are discussed in detail. Finally, the proposed methodology is introduced. • Chapter 6 hashes out the experimental setup and analysis for the proposed single image denoising methodology. • Chapter 7 presents the conclusion and discusses some future work directions.
  • 18. Chapter 2 Background In this chapter, we briefly summarize the fundamental concepts and definitions pivotal to understanding the rest of the thesis work and which we might revisit them frequently in further chapters. 2.1 Deep Learning Deep Learning is a sub-domain under the umbrella of machine learning concerned with algorithms, which aims to imitate the structure and functionality of the human brain called artificial neural networks (ANNs) with representation learning [18]. More specif- ically, a neural network is inspired by a neuron, and in machine learning, it is an infor- mation processing technique that uses the same concept of biological neural networks but not identical to it [19], the analogy shown in Figure 2.1. There are multiple deep learning architectures such as deep neural networks (DNNs), recurrent neural networks (RNNs), and convolutional neural networks (CNNs) that have been applied to various fields, including computer vision, machine vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation ,and bioinfor- matics, where they have produced remarkable results comparable or surpassing human expert performance [18]. The ”learning” in the terminology of ”deep learning” can be either supervised, unsupervised or semi-supervised, whereas the term ”deep” refers to the number of layers through which the input is transformed. 5
  • 19. 6 Figure 2.1: ANN inspired by biological neural networks. The inputs are denoted by x1, x2...xn at the dendrites and outputs by y1, y2, ...yn at the axon terminal ends. Source: Wikipedia[5]. 2.2 Inverse Problems Inverse problem is a procedure of calculating from a set of observations the causal factors that produced them: for example calculating the density of the Earth from measurements of its gravity field [20]. If the information given by a measurement is incomplete (incorrect or improper), then a problem is ill-posed [21]. Thus, Inverse problems try to quantify when a problem is ill-posed and to what degree, and extract maximum information under practical circumstances. Inverse problems occur in many applications, such as image denoising, image deblurring, inpainting, super-resolution etc. [22]. Figure 2.2 depicts a forward problem with respect to an inverse problem. 2.2.1 Image Denoising Image denoising refers to removal of noise from a noisy image, so as to restore the true image. The aim is to recover meaningful information from noisy images in the process of noise removal to obtain high quality images is, which is an important open research problem. The primary reason for this is that from a mathematical perspective, image denoising is an inverse problem and its solution is not unique. Also, Image restoration and image denoising are different terminologies where image
  • 20. 7 Figure 2.2: Inverse problem. Source: caltech GE193[6] denoising is a type of image restoration problem. There are several types of image restoration problems for example: super resolution, inpainting and image denoising is also one of them. Therefore, algorithms that solve image restoration problems will also be applicable for image denoising problems. The work in this thesis is centered around image denoising. 2.2.2 Image Restoration Problem Formulation The problem of image restoration can be traditionally formulated as [23] - y = Hx + c (2.1) where x ∈ Rn represents the unknown original image, y ∈ Rm represents observa- tions and H is an m×n degradation matrix and c ∈ Rm is a vector of i.i.d (independent and identically distributed) Gaussian random variables with mean as zero and standard deviation of σc. Thus, as explained above the Equation 2.1 can represent different image restoration problems. It can represent an image denoising problems when H is the n×n identity matrix In. It depicts image in-painting when H is a selection of m rows of In and image deblurring when H is a blurring operator [23].
  • 21. 8 2.3 Deep Neural Networks A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers [24][25][18]. DNNs, which employ deep architec- tures in NNs can represent functions that have higher complexity if the units in a single layer and numbers of layers are increased [26]. If enough labeled training datasets and suitable models are given, deep learning approaches can help humans establish mapping functions for operation convenience. In this work, which pivots around understanding the theoretical foundations of one such deep learning network, we will mention the use of CNNs and ResNets. The latter is detailed in further sections below. 2.3.1 ResNets ResNet, short for Residual Network is a specific type of neural network that was intro- duced in 2015 by He et al., 2015. [7] they introduced a residual learning framework to ease the training of networks that are substantially deeper than those used prior to 2015 and had explicitly reformulated the layers as learning residual functions with respect to the layer inputs, instead of learning unreferenced functions [7]. Figure 2.3: Residual learning: a building block. (Source: He et al. [7]) The popularity of ResNets stems from that fact that it solved a big open challenge. When deeper networks are able to start converging, a degradation problem had been exposed: with the network depth increasing, accuracy gets saturated and then drops rapidly [7]. Surprisingly, such degradation is not caused by overfitting. Adding more layers to an appropriately deep model leads to higher training error, as reported in
  • 22. 9 [27][28]. In 2015, He et al. [7] empirically showed that there is a maximum threshold for depth with the traditional CNN model. Hence, [7] solved the degradation problem by introducing a deep residual learning framework where a basic unit is called a residual block as depicted in Figure 2.3. Instead of assuming that each of few stacked layers directly fit a desired underlying mapping, they explicitly let these layers fit a residual mapping shortcut. Practically, this idea was realised by feed-forward neural networks with connections that skip one or two layers [7][29][30] called “shortcut connections”. These shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Figure 2.3). One of the biggest advantage of this approach is that these identity shortcut connections add neither extra parameter nor computational complexity. Thus, ResNets are heavily used to solve a variety of modern day tasks. 2.3.2 GANs Ian J. Goodfellow et al., 2014 [31] proposed a new generative model estimation proce- dure - an adversarial nets framework, where the generative model is pitted against an adversary - a discriminative model that learns to ascertain whether a sample is from the model distribution or the data distribution. This is analogous to a team of counterfeits trying to produce some fake items and police who is trying to detect the counterfeit items. The competition in this game drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles. Thus, two neural net- works compete with each other (in the form of a zero-sum game, where one network’s gain is another network’s loss) [32] and this model architecture or framework is called Generative adversarial Network (GANs). More formally, the GAN model architecture involves two sub-models; a generator model for generating new examples and a discrim- inator model for classifying whether generated examples are real, from the domain, or fake, generated by the generator model [33]. Generator model is used to generate new plausible examples from the problem domain. Discriminator is a model that is used to classify examples as real (from the domain) or fake (generated). Although, originally proposed as a form of generative model for unsupervised learn- ing, GANs have also proven effective for semi-supervised learning [34], fully supervised learning [35], and reinforcement learning [36]. A more standardized approach for GAN
  • 23. 10 framework called Deep Convolutional Generative Adversarial Networks, or DCGAN, that led to more stable models was later formalized by Alec Radford, et al. [37] in 2015. 2.4 Autoencoder An autoencoder is a special type of neural network (NN), which is mainly designed to encode the input into a compressed, meaningful representation, and then used to decode it back such that the reconstructed input is as comparable as possible to the original one [8]. They are an unsupervised learning technique where we leverage neural networks for the task of representation learning. Specifically, one designs a neural network architecture such that they can impose a bottleneck in the network which forces a compressed knowledge representation of the original input. If the input features were each independent of one another, this compression and subsequent reconstruction would be a very difficult task. However, if some sort of structure exists in the data (i.e. correlations between input features), this structure can be learned and consequently leveraged when forcing the input through the network’s bottleneck [38]. They have been first introduced in the 1980s by the Hinton and the PDP group [39] as a NN that is trained to reconstruct its input. Mathematically, their main task of learning an ”informative” representation of data that can be used for various implications can be formally defined [8][40] as to learn functions A : Rn → Rp and B : Rp → Rn that satisfy: argminA,B E[∆(x, B ◦ A(x))] (2.2) where E is the expectation over the distribution of x, and ∆ is the reconstruction loss function, that measures the distance between the output of the decoder and the input. The loss function is usually set to be the l2-norm [40]. Usually, A and B are neural networks [41]. But for the special case that A and B are linear operations, it is called a linear autoencoder [42][40]. If in a linear autoencoder we also drop the non-linear operations, then this autoencoder would attain the same latent representation as Principal Component Analysis (PCA) [43]. Therefore, an autoencoder is a generalization of PCA, where instead of finding a low dimensional hyperplane in which the data is found, it is able to learn a non-linear manifold [44]. Thus, While con- ceptually simple, autoencoders are quite popular and play an important role in machine
  • 24. 11 learning. For training autoencoders, it can be done gradually layer by layer or they can be trained end-to-end. In the layer by layer case (or latter case), they are ”stacked” together, that leads to a deeper encoder. In [45], this is done with convolutional autoen- coders, and in [46] with denoising autoencoder. We will revisit autoencoders in chapter 5 and 6. Figure 2.4: An example of an autoencoder. The input image is encoded to a compressed representation and then decoded. (Source: Bank et al. [8]) 2.5 Noise Image noise is random variation of color information or brightness in images, and is generally an aspect of electronic noise [47]. It tells unwanted information in digital images and obscures the desired information. Noise produces undesirable effects such as artifacts, unrealistic edges, unseen lines, corners and blurred objects. There are multiple sources of noise in images, and these noises come from various aspects such as image acquisition, transmission, and compression [48]. 2.5.1 Noise models There are different types of noise models, but we mention only three popular noise models: Gaussian, salt-pepper and uniform noise. Also, there are different processing algorithms for different types of noise models. For any input image, we model noisy image for additive noise as - g(x) = I(x) + v(x) (2.3)
  • 25. 12 I(x) is the original image without any noise, v(x) is the additive noise model and g(x) is the input image with noise. x is set of pixels in the input image. 1. Gaussian Noise: Gaussian noise generally happens in the analog signal in the electronics of the camera. It can be modeled as additive noise and acts on the input image I to produce a degraded image y : y = I + ση η ∼ N(0, 1) (2.4) where σ is standard deviation [49][50][51]. Example of denoising algorithm for this type of noise: Gaussian filtering. 2. Salt and pepper Noise: this impulse noise corresponds to random pixels which are either saturated or turned off. It can happen in equipment with electronic spikes, and we can model this as: y =    I with probability p b with probability 1-p (2.5) where b ∼ Ber(0.5) is a Bernoulli variable of parameter 0.5. Algorithms used for image recovery from this type of noise - median filtering, mean filtering [52][47]. 3. Shot noise: or the photon shot noise is the dominant noise in the brighter parts of an image from an image sensor and is typically caused by statistical quantum fluctuations, i.e., disparity in the number of photons observed at a given exposure level [47]. The root-mean-square value of shot noise is proportional to the square root of the image intensity, and the noises at different pixels are not related to one another. This noise model follows a Poisson distribution, which except at very high intensity levels approximates a Gaussian distribution. 4. Speckle noise: is a granular noise that exists inherently in an image and corrupts its quality. This noise can be generated by multiplying random pixel values with different pixels of an image [48]. A fundamental challenge in optical and digital holography is the presence of speckle noise in the image reconstruction process.
  • 26. 13 2.6 Evaluation metrics The aim of a denoising algorithm is to recover the original image as much as possible from its noise-corrupted version. To evaluate denoising algorithms, different image quality assessment measurements have been adopted to compare the denoised estimation and ground truth high-quality images. Below, three popular representative, quantitative measurements are discussed, amongst which PSNR is most commonly used metric. 2.6.1 MSE Mean Squared Error (of a process for estimating an unobserved quantity) of an estimator measures the average squared difference between the actual and estimated values. MSE is equivalent to the expected value of the squared error loss and signifies the quality of an estimator. It is always non-negative, and values closer to zero are better. For a given noise-free m × n monochrome image I and its noisy approximation say K, mathematically MSE can be defined as [53]- MSE = 1 mn m−1 X i=0 n−1 X j=0 [I(i, j) − K(i, j)]2 (2.6) 2.6.2 PSNR Peak signal-to-noise ratio (PSNR) is a term that signifies the ratio between the max- imum power of a signal and the power of contaminating noise that affects the fidelity of its representation [53]. PSNR is defined via the mean squared error (MSE). Given the ground truth image I and denoised estimation K, based on MSE, the definition of PSNR is: PSNR = 10 log10 MAX2 I MSE (2.7) In the above equation, MAXI is the maximum possible pixel value of the image. This value is 255 when the pixels are represented using 8 bits per sample. While both MSE and PSNR are well accepted and are heavily used in several appli- cations, they are not associated (or correlated) well with the visual perception of human vision system, which is highly non linear and complex [54][55][56]. Thus, they are not
  • 27. 14 a good fit to measure the perceptual similarity between two images. Yet, PSNR is still the most commonly used index to compare two images. 2.6.3 SSIM Besides the MSR and PSNR, perceptual quality measurements have also been proposed to evaluate denoising algorithms. One of the representative measurements is the struc- tural similarity (SSIM) index [57]. The SSIM is a procedure for estimating the perceived quality of digital television and cinematic pictures, as well as other kinds of digital im- ages and videos. This metric is used for measuring the similarity between two images [58]. The SSIM index can be calculated on various windows of an image. The measure between two windows x and y of common size N × N is [58]: SSIM(x, y) = (2µxµy + c1)(2σxy + c2) (µ2 x + µ2 y + c1)(σ2 x + σ2 y + c2) (2.8) where µx, µy are averages of x and y, σx, σy are variance of x and y, σxy co-variance, c1 = (k1L)2, c2 = (k2L)2, c2 = (k2L)2 two variables to stabilize the division with weak denominator, L is the dynamic range of the pixel-values, k1 = 0.01 and k2 = 0.03 by default [57][58]. Above SSIM formula is based on three comparison measurements between the sam- ples x and y; luminance (l), contrast (c), and structure (s). l(x, y) = 2µxµy + c1 µ2 x + µ2 y + c1 (2.9) c(x, y) = 2σxσy + c2 σ2 x + σ2 y + c2 (2.10) s(x, y) = σxy + c3 σxσy + c3 (2.11) where c3 = c2/2. thus, using above 3 definitions, equation 2.3 can be rewritten as (reference: [58]: SSIM(x, y) = l(x, y)α · c(x, y)β · s(x, y)γ (2.12) with setting weights α = β = γ = 1 to obtain similar form as equation 2.3.
  • 28. Chapter 3 Image Denoising Algorithms: Review In this chapter we attempt to capture the research work and methods developed till now for the open challenge of image denoising. There exists several ways to classify existing image denoising algorithms. The three popular approaches to classify them are: • Inspired from image processing field concepts [59] - Spatial domain, Transform domain and neural network (NN) based methods • based on popular families in image restoration methods [10] - learning-based meth- ods and learning-free methods • based on how the an image prior is exploited to generate high-quality estimation wrt to an input image [57]- Implicit and Explicit methods Selecting the more intuitive categorization of the three, we will classify existing denoising algorithms using spatial domain, transform domain and NN based methods. Furthermore, we discuss the prior work in chronological order and as the classical spatial and transform domain based algorithms have been thoroughly reviewed in previous papers [15], [60] hence, we focus more on recently proposed NN based algorithms. 15
  • 29. 16 3.1 Spatial domain methods Spatial domain technique is a traditional denoising method. It is a technique that is directly applied to images in the form of spatial filters for noise removal [61]. Spatial domain methods can be further sub-categorized into - spatial domain filtering (SDF) and variational denoising methods [59]. 3.1.1 Spatial domain filtering Spatial domain filtering methods can also be grouped as implicit methods as per point (2) above, and they can be divided into further two classes - linear and non-linear filtering. Linear filters tend to blur sharp edges, destroy lines and other fine image details, and perform poorly in the presence of signal-dependent noise [59]. For example, a mean filter (linear filer) is optimal for Gaussian noise in the sense of mean square error, but it tends to over-smooth images with high noise. Wiener filter was introduced to combat this advantage, but it also can easily blur sharp edges. Whereas by using non-linear filters, such as median filtering [62][63] and weighted median filtering [64], noise can be suppressed without any identification. For example, Bilateral filtering [65] is widely used for image denoising As it is a non-linear, edge- preserving, and noise-reducing smoothing filter. SDF methods, in general, adopt priors of high-quality images implicitly, where the priors are ingrained into specific restoration operations. Such an implicitly modeling strategy was used in most of the early years’ image denoising algorithms, some of which discussed above [65][66][67][57]. Based on the assumptions of suprior quality images, heuristic operations have been designed to generate estimations directly from the degraded images. For example, based on the smoothness assumption, filtering-based methods. 3.1.2 Variational denoising methods Besides implicitly embedding priors into restoration operations, variational denoising methods explicitly characterize image priors and subsequently use the Bayesian method to produce high quality reconstruction results. Having the degradation model p(y|x) and specific prior model p(x), different estimators can be used to estimate latent image
  • 30. 17 x. One popular approach is the maximum a posterior (MAP) estimator, where x̂ = arg max x p(x|y) = arg max x p(y|x)p(x) (3.1) (using Bayes theorem), with which we seek, given the corrupted observation and prior, the most probable estimation of x. In the case of AWGN (Additive white Gaus- sian Noise), the above equation 3.1 can be reformulated as an objective function as a summation of a data fidelity term (least squares) and a regularizer (more details for the same discussed in ahead chapters). Thus, for the variational denoising methods, the key is to find a suitable image prior for example, some successful prior models include gradient priors, non-local self- similarity (NSS) priors, sparse priors, and low-rank priors [59]. Total variation (TV) regularization TV regularization is defined on the statistical fact that natural images are locally smooth, and the pixel intensity gradually varies in most regions [59]. TV regularization uses a Laplacian distribution to model image gradients, resulting in an l1 norm penalty on the gradients of the estimated image. Mathematically it can be defined it as [68] : RTV (x) = ||∇x||1 where ∇x is gradient of x. Total variation is one of the most extensively used image priors that promotes spar- sity in image in image gradients, effectively calculate the optimal solution and can also retain sharp edges. Although, it has been shown to be beneficial in a number of applica- tions [69][70][71] and is one of the most notable methods for image denoising, it has few limitations. The 3 main disadvantages are that 1) textures tend to be over-smoothed 2) flat areas are approximated by a piece-wise constant surface resulting in a stair-casing effect and 3) the resultant image suffers from losses of contrast [72][68][73][59]. Extensive studies have been conducted to improve on TV regularizer performance in image smoothing by adopting partial differential equation while some have also proposed wavelet filters for analysis sparse filters [74][75]. Beck et al. [76] proposed a fast gradient- based method for constrained TV, which is also a generic framework for covering other types of non-smooth regularizers. [77] proposed a statistical model to structure the heavy tailed distribution of coefficients via robust penalty function - lp norm and [78] introduced a normalised sparsity measure.
  • 31. 18 3.2 Transform domain methods Conversely, with spatial domain filtering methods, transform domain (TD) filtering methods first transform the given noisy image to another domain and then apply a denoising procedure on the transformed image based on the different image and noise characteristics (larger coefficients denote the high-frequency part, i.e., the details or edges of the image, whereas smaller coefficients denote the noise). These methods operate on an important observation that the characteristics of image information and noise are different in the transform domain. Furthermore, the transform domain filtering methods can be subdivided based on the chosen basis transform functions - it may be data-adaptive or non-data adaptive [79]. 3.2.1 Data adaptive methods Examples of data-adaptive methods are Independent component analysis (ICA) [80], 86] and PCA [81][82] functions. Both of them are adopted as the transform tools on the given noisy images. One of the main disadvantages of data-adaptive methods is that they have high computational costs because they use sliding windows and need a sample of noise-free data or at least two image frames from the same scene [59]. However, it is entirely possible that in some applications, it might be challenging to obtain noise-free training data. 3.2.2 Non Data adaptive methods The non-data adaptive TD filtering methods can additionally be subdivided into two domains - spatial-frequency domain and wavelet domain. We will not discuss the spatial- frequency domain in this work. But we will briefly mention the wavelet transform below because it is one of the most researched transform techniques. [83] breaks down the input data into a scale-space representation. Also, it has been proved that wavelets can successfully remove noise while preserving the image characteristics, regardless of its frequency content [84][85][86][87][59].
  • 32. 19 3.2.3 Block-matching and 3D filtering: BM3D Bm3D [1] is a non-local, adaptive non-parametric filtering image denoising strategy based on an enhanced sparse representation in the transform domain. The enhancement of the sparsity is attained by grouping similar 2D fragments of the image into 3D data arrays, which are ”groups”. To deal with these 3D groups, a special procedure called collaborative filtering is used. This procedure includes three consecutive steps: the 3D transformation of a group, shrinkage of transform spectrum, and inverse 3D transformation. Thus, the 3D estimate of the group is obtained, which consists of an array of jointly filtered 2D fragments. Due to the similarities between the grouped blocks, the transform can achieve a highly sparse representation of the true signal so that the noise can be well separated by shrinkage. Thus, collaborative filtering exposes even the finest details shared by grouped fragments, and at the same time, it preserves the important unique features of each individual fragment [1][88]. Figure 3.1: Scheme of the BM3D algorithm. (credits: Marc Lebrun [9]) This is one of the most popular, powerful, and effective denoising methods and has been state-of-art until recently. After the original work of [1], many improved versions of BM3D were also developed [89][90]. [90] proposed the block-matching and 4D filtering (BM4D) method. Dabov et al., 2007 [91] proposed an improvement of the BM3D method for color image denoising that exploits filtering in a highly sparse local 3D transform domain in each channel of a luminance-chrominance color space. Further, many follow-up works combined the sparse prior and NSS prior [92]. [93] collected non- local similar patches to solve the group-sparsity problem to achieve better denoising results. [94][57] proposed a non-local centralized sparse representation model in which
  • 33. 20 the mean value of the representation coefficients is pre-estimated based on the patch groups. 3.3 Deep Neural Network methods The original deep learning technologies were first used in image processing in the 1980s [95] and were first used in image denoising by Zhou et al. [96][97][98]. After that, a feed forward network was used to reduce the high computational costs and to make a trade-off between denoising efficiency and performance [99]. The feed-forward network can smooth the given corrupted image by Kuwahara filters [100], which were similar to convolutions. Although these techniques were effective, these networks did not allow the addition of new plug-and-play units, which restricted their generalization abilities and usage in practical applications. To overcome these limitations, Convolutional neural networks [101] were proposed. Although, they had a slow start since their introduction due to a number of then ex- isting issues like vanishing gradients problem, activation functions such as sigmoid and tanh’s high computation cost, lack of appropriate hardware for efficient computations, etc. But after the inception of AlexNet in 2012, things changed, and deep network architectures were widely applied in fields of video, natural language processing, speech processing, etc. In recent years, several CNN-based denoising methods have been pro- posed [102][103][104][2][105][59]. Compared to that of [106], the performance of these methods has been greatly improved. Furthermore, numeral network based (or CNN- based) denoising methods can be divided into two categories: multilayer perceptron (MLP) models and deep learning methods [59]. We discuss both the categories briefly. A multilayer perceptron (MLP) is a class of feed-forward artificial neural networks (ANN) [107]. Some popular MLP-based image denoising models include auto-encoders proposed by Vincent et al. [103] and [104]. Chen et al. [102] introduced a feed-forward deep network called the trainable non-linear reaction diffusion (TNRD) model, which achieved a better denoising effect. In general, MLP-based methods are beneficial as they work efficiently owing to fewer inference procedure steps. Moreover, because optimiza- tion algorithms [108] have the ability to derive the discriminative architecture, these methods have better interpretability. Although, on the other hand, interpretability can
  • 34. 21 increase the cost of performance [2]. [109] presented a patch-based denoising algorithm that is learned on a large dataset with a MLP. Results on additive white Gaussian (AWG) noise were competitive but the method had it’s limitations as generalisation performance for different noise models was not competitive. Deep networks were first applied to image denoising tasks in 2015 [110][111]. They [110] pretrained stacked denoising auto-encoder, and to prevent co-adapting between units, they applied dropout. Thus, the combination of dropout and stacked auto-encoder enhanced performance and time-reduction in fine-tune phase. For addressing multiple low-level tasks via a model, a denoising CNN (DnCNN) [2] consisting of convolutions, batch normalization (BN) [112], rectified linear unit (ReLU) [113] and residual learning (RL) [7] was proposed to deal with image denoising and, other image restoration tasks. Therefore, CNN-based denoising methods were a success, and the reason is attributed to their large modeling capacity and tremendous advances in network training, and design. However, discriminative denoising methods at that time (2018) were limited in flexibility, and the learned model was usually tailored to a specific noise level, and methods like [2] did not generalize well for noise models other than AWGN on which they were trained on. To mitigate the above said limitations, [114] introduced a fast and flexible denoising CNN (FFDNet) which presented different noise levels and the noisy image patch as the input of a denoising network and improved denoising speed and process blind denoising. A generative adversarial network (GAN) CNN blind denoiser (GCBD) [115] resolved the problem of handling unpaired noisy images by first generating the ground truth, then using the obtained ground truth as input into the GAN to train the denoiser. For more complex corrupted images, a deep plug-and-play super-resolution (DPSR) method [114] was developed to estimate blur kernel and noise and recover a high-resolution image. Thus, in this section, we discussed existing neural network based methods for image denoising, including both CNN and MLP based methods. Both of which are learning- based approaches towards image denoising. In the next chapter, we will focus on learning-free approaches like ”Deep Image Prior”, whose intuition is in direct contra- diction to learning-based methods and are the foundation of most current state-of-art image denoising methods.
  • 35. Chapter 4 Deep Image Prior In this Chapter, we introduce concepts of Deep Image Prior (DIP), discuss in detail the model architecture and usage. after that we introduce few other works that are built upon this concept. 4.1 Image Priors Before discussing the seminal work of [10], it is crucial to understand the concept of image priors. Image priors are prior information on any set of images [110] that one can use in image processing (or computer vision) problems to enhance results, ease the choice of processing parameters, resolve indeterminacies , etc. These priors, or their approximations, can be converted into mathematical formulations and used as a part of some central mechanism or procedure (or algorithm). 4.2 Deep Image Priors Ulyanov et al. [10] shows that the structure of a generator network is sufficient to encapsulate a great deal of low-level image statistics prior to any learning. What makes this idea outstanding is that it is in direct contradiction to traditional understanding that excellent performance of deep convolutions networks for denoising tasks is imputed to their ability to learn realistic image priors from large number of example images. With their novel concept there is no need to train a network on a dataset or even 22
  • 36. 23 perform any training at all. To show this, we apply untrained Convolutional Neural Networks (ConvNets), and instead of training a ConvNet on a large dataset of sample images, we fit a generator network to a single corrupted image. This way, the network weights serve as a parame- terization of the restored image, and the weights are randomly initialized and fitted to a specific degraded image under a task-dependent observation model. In such a way, the only information used to perform image reconstruction is contained in the single noisy input image and the handcrafted structure of the network used for reconstruction [10]. The following four important empirical results are contributed by the paper [10] are- • Low-level statistics can be captured by an untrained network. • The model parameterization presents a high impedance to image noise, and hence, it can be naturally used to filter out noise from a given image. Also, the DIP method can work under blindness assumption. • Choice of deep generator ConvNet architecture does have an impact on results as different architectures impose rather different priors. • DIP is similar to BM3D, one of the most popular transform techniques in the respect that they both exploit self-structure and similarity. 4.2.1 Method First, it is important to see the mathematical formulation for the idea of Deep image priors upon which pivots some important future concepts ahead. A function with a one-dimensional input and a multidimensional output can be thought of as drawing a curve in space, and such a function is called a parametric function (its input is called a parameter) [116]. Deep generator network is an example of a parametric function that maps a code vector z to an image x [31]. x = fθ(z) (4.1) If we were to interpret the neural network as a parameterization given in the above equation (number) of the image x ∈ R3×H×W (channels, height, width), then in this perspective, the code i.e z is a fixed randomized tensor z ∈ RC0×H0×W0 . The neural
  • 37. 24 network then can be viewed as mapping the parameters θ (weights from different layers, bias of the filters in the networks) to the input image x. To model conditional image distributions p(x|x0) where x is a natural image and x0 its corrupted version for image restoration problem (denoising), we can view such tasks as energy minimization [10][117] problem - x∗ = argmin x E(x; x0) + R(x) (4.2) where E(x; x0) is a data term (that is task-dependent), and R(x) is a regularizer which is not tied to a specific application and captures the generic regularity of natural images. Instead of using an explicit regularizer term in equation Equation 4.2 the work [10] proves that using implicit prior captured by the neural network parameterization performs better for all image restoration tasks. Thus, the formulation can be re-written as [10]: θ∗ = argmin θ E(fθ(z); x0), x∗ = fθ∗ (z) (4.3) The local minimizer θ∗ can be obtained using an optimizer such as gradient descent and starting from a random initialization of the parameters θ. (Figure 4.1). As the only information available to a restoration task is the noisy image x0, given the above equation 4.3, the denoising process is obtained as x∗ = fθ∗ (z). Another important fact demonstrated by [10] was that the choice of network ar- chitecture has a major impact on how the solution space is searched by methods such as gradient descent. This was an important observation because even though almost any image can befitted by the model, they empirically show that the choice of archi- tecture has a different impact on performance for different image restoration tasks. and that the network resists “bad” solutions and descends much more quickly towards naturally-looking images. The result is that minimizing Equation 4.3 either results in a good-looking local optimum or, at least, that the optimization trajectory passes near one as shown in Figure 4.1 [10]. Therefore, to understand this idea better, we can view this mathematically - given a basic reconstruction task where the target image is x0 and we want to find values of parameters θ∗ that reproduce the original image, the E(x; x0) term in equation 4.3, can
  • 38. 25 Figure 4.1: Image space visualization for DIP. Assume the problem of reconstructing an image xgt from a degraded measurement x0. The image exemplified by denoising, the ground truth xgt has non-zero cost E(xgt, x0) 0. Here, if run for long enough, fitting with DIP will acquire a solution with near zero cost quite distant from xgt. However, often the optimization path will pass close to xgt, and an early stopping (here at time t3) will recover good solution. Source: Ulyanov et al. [10] be modelled as the D2 distance that compares the generated image x with x0: E(x; x0) = ||x − x0||2 (4.4) ⇒ minθ ||fθ(z) − x0||2 (4.5) fθ(z) is (typically) a deep CNN with U-Shaped architecture [118]. One reason why they are preferred is because one can draw samples from a DIP by taking random values of the parameter θ and looking at the generated images fθ(z). Equivalently, this means we can visualize the starting points of the optimization process (Eq. 4.3) before we even fitting the parameters to the noisy input image. Also, [10] empirically shows that the samples exhibit spatial structures and self-similarities, and the scale of these structures depends on the network depth. Therefore, adding skip connections results in images that contain structures of different characteristic scales, as is desirable for modeling natural images. It implies then that natural that such architectures are the most popular choice for generative ConvNets. Also, this U-shaped architecture is an encoder-decoder (”hourglass”) network with skip connections. Leaky-Relu is used for activation function and ADAM optimizer.
  • 39. 26 Figure 4.2: Figure depicting image restoration process using DIP. Starting from a ran- dom weight θ0, one must iteratively update them in order to minimize the data term eq. (4.3). At every iteration t the weights θ are mapped to an image x = fθ(z), where z is a fixed tensor and the mapping f is a neural network with parameters θ. The image x is used to calculate the task-dependent loss E(x, x0). The loss gradient w.r.t. the weights θ is then calculated and used to update the parameters. Source: Ulyanov et al. [10] Figure 5.1 in next chapter depicts the hourglass architecture. All works that improve over DIP have similar architectural details as described here. 4.2.2 Important Results The DIP paper shows experimental results for various image restoration tasks, including single image denoising. They also claim that their model can work under the assumption of blind denoising - where we do not know the noise model and can successfully recover images from complex corruptions. The paper also has shown results for Gaussian noise model but not for impulse noise, shot noise, or other noise models. One important observation is that the DIP method’s performance for the Gaussian noise model is similar to non-local learning-free approaches like CMB3D [91] but outperforms for non- Gaussian noise models. A hand-crafted prior method is in which we embed hard constraints and teach what types of images are face, natural, etc., from the synthesized data. As no part of the
  • 40. 27 neural network fθ is learned from a dataset prior to this, such a deep image prior is effectively hand-crafted, and empirically it is shown to outperform many standard non-learning priors such as TV [119], BM3D [90] and few learning-based approaches. 4.3 Related Work DIP method is similar to the works that exploit the self-similarity properties of natural images and does not undergo training on hold-out set. In that regard, this approach is similar to the BM3D approach (section 3.2.3) and the Non-Local means algorithm [120]. These methods avoid any training and have hand-crafted priors. The DIP work [10] has demonstrated a remarkable phenomenon that CNNs can be used for solving image restoration problems without any offline training and exter- nal data. Since then, many algorithms have been developed that improve upon this extraordinary idea. [121] talks about an improvement using ”backprojection” or BP. As we know, that image restoration tasks can be formulated as minimization of a cost function, composed of a fidelity term and a prior term. We saw this type of formulation in Chapter 2 Section 4.2.1, and it can be further generalized as follows: min x l(x, y) + βs(x) (4.6) where l is the fidelity term, s is the prior term, and β is a positive parameter that controls the level of regularization [121]. Backprojection fidelity term was first introduced in 2018 by [23] as an alternative to the widely used Least square (LS) fidelity term [121] : l(x, y) = 1 2||y −Ax||2 2 and empiri- cally it has been shown that this fidelity term, for different priors that we have discussed till now for example: TV, BM3D and pre-trained CNNs, yields better recoveries than LS for badly conditioned A and requires fewer iterations of optimization algorithms. In [121], they demonstrate the use of the BP fidelity term for improving the performance of standard DIP (which uses LS fidelity term as the loss function). Although, the paper only examines the performance for image deblurring tasks. It still remains to evalu- ate this method’s performance for the remaining image restoration tasks such as image denoising and super-resolution. In another line of work, Cheng et al. [122] show that by conducting posterior inference using stochastic gradient Langevin dynamics, once
  • 41. 28 can avoid the need for early stopping, which is a major limitation of the current DIP approach, and improve results for image restoration tasks. They prove that the DIP is asymptotically similar to a stationary Gaussian process prior as the number of channels in each layer of the network goes to infinity (in the limit) and derives the corresponding kernel [122]. A Gaussian process is an infinite collection of random variables for which any finite subset is jointly Gaussian distributed [122][123]. In another training-free ap- proach, [124] presents a self-supervised learning method for single-image denoising. In the introduced method, the network is trained with dropout on the pairs of Bernoulli- sampled instances of the input image. The result is then estimated by averaging the predictions generated from multiple instances of the trained model with dropout. The authors empirically show that the proposed method not only significantly outperforms existing single-image non-learning methods but also is competitive to the denoising net- works trained on external datasets. Although, it still requires dropout and might be unstable without early stopping. 4.4 Limitations of Deep Image Priors DIP is one of the most popular methods for reconstruction tasks and was state-of-art until recently. Though the methods have lots of merits, it has many limitations as well. Some of the most significant unsolved issues for the DIP method is the need for early stopping. Unless we employ early stopping, the PSNR (or SSIM) values will drop after some number of iterations. [10] also employs early stopping and nearly most of the research that came after DIP almost always use some level of early stopping. Another issue is that there is obscurity towards the explanation of why does image prior emerge, and why does these priors fit the structure of natural images so well. There is no definitive answers for these questions that explain their effectiveness. With respect to DIP’s practical applications, there are 2 main problems - it is extremely slow and is unable to match or exceed the results of problem-specific methods [10]. For example, we saw in Results subsection 4.2.2 that the performance for Gaussian noise model for image denoising did not significantly outperform the non-local state-of-art methods like CMB3D [91] or NLM [120]. In work by Ulyanov et al. [10], they have used for image denoising task the image corruption rate is less than 0.5, so it remains
  • 42. 29 to see if this method is efficient for higher corruptions with respect to execution time and performance trade-off.
  • 43. Chapter 5 Rethinking Single Image Denoising Till now, we have discussed background concepts related on image denoising task and few of the earlier methods that attempt to tackle it. We then discussed some deep learning approaches to the open task specifically deep image priors. In this chapter we will introduce few potential approaches to solve the limitations of DIP mentioned in last the chapter and describe the experiments we conducted as a part of this thesis work1 . 5.1 Over-parameterisation in deep learning Although DIP is extremely popular method since 2017, it has its limitations. As we gain better insight and understanding regarding generalizing abilities of DNNs, a new concept of learning over-parameterized models has emerged. It is becoming a crucial topic in machine learning since 2017 [125][11][126][127]. Over-parameterization occurs when the number of learnable parameters is much larger than the number of the training samples (or equivalently when we fit a richer model than necessary). Deep artificial neural networks operate in this regime where they have far more trainable model parameters than the number of training exam- ples. Nevertheless, some of these models exhibit remarkably small generalization error, i.e., the difference between “training error” and “test error”. The traditional learning 1 Work done under project investigator Taihui li and as a part of Sun research group. 30
  • 44. 31 theory suggests that when the number of parameters is large, some form of regular- ization is needed to ensure small generalization error [128]. But recent research has shown contradictory results to the traditional learning theory and has found that over- parameterization empirically improves both optimization and generalization. 5.1.1 Overparameterisation v/s over-fitting? It is important to remark that overparameterization and over-fitting are two different phenomena, and over-parameterisation does not lead to over-fitting. When conventional learning theory remarks that over-parameterization leads to over-fitting, the parame- ters concerned are about hypothesis space from which the classifiers are constructed, whereas, in deep neural networks, such parameters are those of the classifier construc- tion part (fully connected layers). Thus, the learning theory concerns mostly about the training of a classifier (learner) in classification tasks from a feature space, but it tells little about the construction of the feature space itself. Thus, though we can use the conventional theory to reason about generalization, we must to cautious when this theory is applied for representation learning. This fact is demonstrated and debated convincingly in [129]. 5.1.2 Regularisation Regularisation is a collective group of strategies that are explicitly designed to reduce test error so that an algorithm will perform well not only on training data but also on unseen inputs (test error). Many forms of regularization are available to the deep learning practitioner. In fact, developing a more effective regularization strategy has been one of the major research efforts in the field. There are two major types of regularisation relevant to our thesis discussion - implicit and explicit regularisation. • Regularization introduced either as an explicit penalty term or by modifying opti- mization through, e.g., drop-outs, weight decay, or with one-pass stochastic meth- ods can be referred to as explicit regularisation. A lot of work has been done on understanding the effects of explicit regularisation on training data and deep learning models’ performance.
  • 45. 32 • Implicit regularization would imply that some sort of regularization is being in- troduced implicitly in a model. For example: in Neyshabur et al., 2014 [130], they reason that the rationale behind low generalization error seen with overparame- terized models is caused by an implicit regularization introduced by optimization of the network. The optimization objectives for learning high capacity models (which are overparameterized) have many global minima that fit training data perfectly. Implicit regularization was adapted in DIP as well. 5.2 Low-rank matrix recovery problem In this section, we will see the intuition behind the low-rank matrix recovery problem and prior work that attempts to solve this issue. It is important to study this problem because the network f(θ) in DIP [10] (chapter 4) has a U-shaped architecture and can be viewed as a multi-layer, nonlinear extension of the low-rank matrix factorization X = UUT . Therefore, DIP also inherits the drawbacks of the exact-parameterization approach for low-rank matrix recovery. Namely, it requires either a meticulous choice of network width or early stopping of the training process [11]. Low-rank matrices play an essential role in modeling and computational methods for machine learning. They lay the foundation for both classical techniques such as principle component analysis [131][132][133] as well as modern approaches to multi-task learning [134][135] and natural language processing. Specifically, they have broad applications in face recognition [136] (where saturation in brightness, self-shadowing, or specularity can be modeled as outliers), video surveillance [136][11] (where the foreground objects are usually modeled as outliers)and beyond. However, the matrices we are eventually interested in can be extremely large. Al- though memory costs (or costs for acquiring data) are getting cheaper, this will only encourage bigger matrix sizes. This causes a number of issues, primarily that fully ob- serving the matrix of interest can prove to be an impossible task. They can also be corrupted with large errors. In such a case, we are left with a highly incomplete set of observations, and unfortunately, in many of the most popular approaches to processing the data in the low-rank matrices applications, we assume a fully sampled data set is
  • 46. 33 available. Another common assumption is that these approaches are generally not ro- bust to missing/incomplete data. Thus, we have an inverse problem of retrieving the full matrix from these incomplete observations. While such a recovery is not always possible in general, but when the matrix is of low rank, it is possible to exploit this structure and execute this kind of recovery in an astonishingly efficient manner [133]. Therefore, low-rank matrix recovery is an essential step towards solving many of the or actual applications discussed above. There are several methods to solve low-rank matrix recovery problems, out of which the most commonly used in practice are low-rank approximation, low-rank recovery and nuclear norm minimization, iterative hard thresholding, and alternating projections. A long-established method for low-rank matrix recovery is via nuclear norm min- imization. Such a method is provably accurate under certain incoherent conditions [136][137]. However, minimizing nuclear norm involves expensive computations of sin- gular value decomposition (SVD) of large matrices [133] (when n is large), which forbids its application to problem size of practical interest. But these issues have been mitigated with the recent development of matrix factorization methods [138][139]. These methods reply on parameterizing the signal X ∈ Rn×n via factorization X = UUT . This gives rise to a non-convex optimization problems with respect to U = ∈ Rn×r, where r is the rank of X∗ [140]. 5.3 Rethinking Single Image denoising: Main Ideas Till now, fundamental concepts of over-parameterization, regularization, and low-rank matrix recovery are discussed, all of which are vital towards understanding the next two ideas presented in recent works by You et al. [11] and Jing et al. [141]. Finally, we will present an ensemble approach that combines concepts of these two papers and argue as to why the combination of these two ideas might provide insight for solving current DIP limitations and be a step towards robust image recovery.
  • 47. 34 5.3.1 Image denoising via Implicit Bias of Discrepant Learning Rates In [11], the authors discuss that the challenges associated with the exact-parameterization methods can be simply and effectively dealt with via over-parameterization and dis- crepant learning rates. Their arguments are supported by the recent results in [126][142] for low-rank matrix recovery. The success for the success of this method is the notion of implicit bias of discrepant learning rates. The concept is that the algorithmic low-rank and sparse regularizations need to be balanced for the purpose of discerning the underlying rank and sparsity. In absence of of means for tuning a regularization parameter [11], the authors show that the desired balance can be acquired by using different learning rates for different optimization parameters. Below are the four subsections that summarize the main results and algorithms discussed in this paper. Double Over-Parameterization Formulation In [11], the aim is to learn an unknown signal X∗ ∈ Rn×n from its grossly corrupted linear measurements: y = A(X∗) + s∗ (5.1) where operator A(·) : Rn×n → Rm, and s∗ ∈ Rm is a sparse corruption vector (this formulation is similar to discussed in Section on noise models). Equivalently, it is a problem of recovering a rank-r (r n) positive semi-definite matrix X∗ from its grossly corrupted linear measurements as given in equation 5.1. The work introduces a double over-parameterization approach for robust matrix recovery, with double-parameterization of X = UUT and s = g ◦ g − h ◦ h: min U∈Rn×r0 ,{g,h}⊆Rm f(U, g, h) := 1 4 ||A(UUT ) + (g ◦ g − h ◦ h) − y||2 2 (5.2) where the dimensional parameter r0 ≥ r. Practically, the choice of r0 depends on how much prior information we have for X∗. It can be either taken as an estimated upper bound for r or takes as a r0 = n with no prior knowledge. Thus, the authors introduce a method that is based on over-parameterizing both the low-rank matrix X∗ and the outliers s∗ and thereby leveraging implicit algorithmic bias to find the correct solution (X∗, s∗).
  • 48. 35 Algorithmic Regularizations via Gradient Descent In general, over-parameterization leads to under-determined problems which can have an infinite number of solutions (analogous to linear algebra where the number of parameters exceeds the number of equations). Thus, not all solutions of doubly over-parameterized equation 5.2 will correspond to desired (X∗, s∗). The paper empirically and theoreti- cally proves that the gradient descent iteration on equation 5.2 with properly selected learning rates enforces implicit bias on the solution path and thereby automatically iden- tifying the desired, regularized solution (X∗, s∗). Proof of the above ideas is beyond the scope of this work. Implicit Bias with Discrepant Learning Rates Optimizing a linear multi-layer neural network via gradient descent leads to a low-rank solution and this phenomenon is known as implicit regularization. It has been exten- sively studied under the context of matrix factorization [126][143][144], linear regression [145][146], logistic regression [125], and linear convolutional neural networks [127]. It is well known that optimization algorithms like gradient descent introduces implicit bi- ases (without early stopping) [125] and play a crucial role in generalization ability of learned models. But it is still an open challenge of how to control the implicit regular- ization of the gradient descent. Also, they theoretically found the value of penalty (λ) as the algorithm approaches convergence for the unconstrained Lagrangian formulation of rank-r matrix X∗. They further prove that controlling the implicit regularization without explicitly adding any regularization term in equation 5.2 can be achieved by adapting the ratio of learning rates. This observation directly contradicts conventional optimization theory [147] that learning rates only affect algorithm convergence rate but not the quality of the solution. Extension to Natural Images Denoising Finally, combining the ideas from all the above subsections to solve the image restoration problem, the approach in [11] is inspired from [10]’s DIP method where they use the formulation in the equation 5.1 with X = φ(θ), which is a deep convolutional network and θ ∈ Rc represents network parameters.
  • 49. 36 Figure 5.1: Architecture used in [10] and also the base architecture for You et al. [11]. The hourglass (also known as decoder-encoder architecture. It sometimes has skip connections represented in yellow. nu[i], nd[i], ns[i] correspond to the number of filters at depth i for the upsampling, downsampling, and skip-connections respectively. The values ku[i], kd[i], ks[i] correspond to the respective kernel sizes. Source: Ulyanov et al. [10] With respect to implementation details, the network φ(θ) is similar to the original DIP work [10]. It has the same U-shaped architecture with skip connections where each layer contains a convolutional layer, LeakyRelu layer, and a batch normalization layer. The noise model for the images in the You et al. is salt and pepper noise. Thus, the ideas presented in [11] are promising. Due to algorithmic bias of discrepant learning rates, the need to tune network width or early termination is eliminated because it alleviates the problem of over-fitting for robust image recovery. The above advantage also enables the method to recover different image types with varying levels of corruption levels without the need to tune the network learning parameters. That means for different noise models you would not need to change the network width or other learning parameters. 5.3.2 Implicit Rank-Minimizing Autoencoder Autoencoders (AE) is a popular category of methods for learning representations with- out requiring labeled data. We discussed this more in detail in Section 2.4. An essential component of autoencoder methods is the method by which the information capacity
  • 50. 37 of the latent representation is minimized or limited. In [141], the rank of the covari- ance matrix of the codes is implicitly minimized by depending on the fact that gradient descent learning in multi-layer linear networks leads to minimum-rank solutions. 5.4 Proposed Methodology Now that we understand the two main ideas presented in this chapter, we discuss an ensemble method that is an amalgamation of these two ideas. Exploiting the double over-parameterization of two low-dimensional structures in the image restoration ob- jective along with discrepant learning rates to regularize the optimization path, we can forgo the new need of early termination and parameter tuning. Whereas adding addi- tional linear between layers encoder and decoder ensures the minimum possible rank regularized solution which ensures convergence. With respect to network details, it is not entirely similar in structure as You et al. [11] which is also U-shaped architecture with skip connections like DIP. Our method has a U-shaped architecture with residual connections and some more modifications. But we make some additional changes. We change the constituents of the encoder- decoder blocks (as shown in figure 5.1) from vanilla ConvNet Layers to ResNet. For our method, we have multiple ResNet blocks, and each block consists of three each of convolutional, batch normalization, and Leaky ReLU layers. This is in contrast to the original DIP which did not have batch normalization layers or LeakyRelu. You et al. [11] used ReLU activation instead. Also, we do not over-parameterize our noise model as is done by You et al. [11]. Another change with respect to DOP (we will sometimes refer You et al. work as ”Double Over-Parameterized Prior” - DOP) is that we added additional three linear layers between the encoder-decoder blocks inspired by Jing et al. [141]. In the next chapter, we share results with respect to different number of linear layers. In default configuration, we always use l1 loss as compared to DIP and DOP, which use MSE loss. Thus, towards finding an effective solution for our problem of single image denoising, we proposed a solution that ensures minimum rank regularized solution via double over parameterization of both the minimum rank matrix signal and sparse corruption vector and leveraging implicit algorithmic bias. In this chapter, we laid the foundations of this
  • 51. 38 approach with theoretical arguments. In next chapter, we detail the experimental steps and results for our approach.
  • 52. Chapter 6 Preliminary Experiments In this chapter, we present results and analyze observations on some preliminary exper- iments that were done during the course of this thesis work. The results stated below are a part of an on going effort towards solving some of the limitations of single-image denoising (as discussed in chapter 4 and 5). 6.1 Dataset In this thesis work, we have focused on image restoration exclusively single-image de- noising. For developing the algorithm, the popular set of images, which are also used widely in almost all image denoising works are - a set of 8 images: Lena, peppers, F16- GT, Barabara, Lake, Kodak Inc., baboon, snail. These are the standard bench-marking images, and we have also tested our algorithm on a subset of this set - F16 and Lena. 6.2 System Configuration We used Google Colab Pro, NVIDIA Tesla P100 GPU server, 16GB system RAM and 100GB of Google cloud storage for development and experimentation. In our work, we have used the 1.8.1 Pytorch libraries. 39
  • 53. 40 6.3 Hyper-parameter tuning We have tuned our models with all the possible combinations of values mentioned below for the following parameters: learning rate, optimization functions, kernel size, activa- tion functions, and input noise models. We have varied the learning rates from 1e-5 to 0.1. The optimizers used are Adam [148], and SGD [149], and the activation functions used are ReLU and Leaky-ReLU [150]. The input noise models are Gaussian and salt and pepper models for different corruption rates varying from 20% to 90%. 6.4 Results and Observations In the figures below, we show results for three different methods - DIP, You et al. [11] and our proposed approach (outlined in previous chapter). All results shared used SGD as optimizer and learning rates of 0.01, τ = 1, and 0.1. For DIP, we use a variant of the original method where loss function used is l1 loss. DIP-l1 gives far better results than MSE DIP. Similarly, for You et al., and our proposed approach, we use l1 loss function. We report results using PSNR values where for DIP-l1 we output on the last iterations averaged using exponential sliding window (as reported in the paper, Average Output of the model gives excellent results in Blind image denoising.) Below we show results for our architecture for image Lena. Noise models used are salt and pepper with various corruption levels (starting from 0.5 to 0.9) and Gaussian noise with σ = 25. In Figure 6.1, we depict Lena, its 60% corrupted version with salt and pepper noise, and the reconstructed image using our proposed denoising algorithm. Figures 6.4, 6.5, and 6.6 show comparison in performance for all three methods for the same corruption type and level. We performed experiments with various corruption levels for all three methods and saw that as corruption level increases (salt and pepper noise), the best PSNR values decrease, and reconstruction becomes less noiseless. This fact is depicted in Figure 6.9. Our method and You et al. [11] performs better with higher corruption levels than DIP while our method performs mostly better than rest for corruption levels lower than 70%. For higher corruption level You et al. [11] (or DOP method) is slightly better but at average our method is more stable and consistent in performance than the rest. The results reported in this chapter are all based on l1 loss, even for DIP we use it’s l1 loss
  • 54. 41 variant when the original paper [10] uses MSE loss. Another important observation is that You et al. [11]’s method, if used with ADAM optimizer, needs early termination to stop the dip in the PSNR values. The same trend is seen in our approach. Figure 6.7-6.8 depicts this observation for You et al. [11] (DOP) approach and shows that the network starts learning noise in the absence of early stopping. The reconstruction PSNR worsens as training progresses. Among another novelty of our method - addition of linear layers added between the encoder-decoder block, there are important observations regarding the effect of increasing linear layers. As Jing et al. prove that adding more linear layers will increase the regularization effect, this claim was consistent with our observation as well, along with another important result. For increasing the linear layers from 3 to 6 as the regularization effect increases, the time to reach best PSNR decreases (as shown in Figure 6.10). Not only the network becomes comparatively more stable the number of epochs needed to reach the best PSNR also decreases. On average of 3 separate runs, the average number of epochs for 3 and 6 linear layers were: 19625, 15375 consecutively. For 9 linear layers average of 3 different trails was 16375, which is lesser than 3 layers’ time but slightly greater than 6 layers’ epochs. With more number of trials, there is a possibility that we might see a clearer trend. (PS: the Figures 6.9-6.10 report results on F16-GT image).
  • 55. 42 Figure 6.1: From top left to bottom right: (a) The images in top row show ground truth image Lena and (b) its noisy counterpart using 60% corruption level for salt and pepper noise. The bottom row images show (c) Real image same as (a), (d) noisy image same as (b) and the, (e) reconstructed image using our approach.
  • 56. 43 Figure 6.2: From top to bottom: (a) The image in top shows PSNR plot for correspond- ing Figure 6.1, (b) loss plot (L1 loss) for reconstruction process using our approach.
  • 57. 44 Figure 6.3: From top left to bottom: (a) The images in top row show ground truth image F16-GT and (b) its noisy counterpart using 80% corruption level for salt and pepper noise. The bottom row image shows (c) the reconstructed image using our approach. The best PSNR achieved is 21.8671. As we can see for higher corruptions we get poorer performance.
  • 58. 45 Figure 6.4: From top left to bottom: (a) The images in top row show ground truth image F16-GT and (b) its noisy counterpart using 50% corruption level for salt and pepper noise. The bottom row image shows (c) the reconstructed image using [10] DIP-l1 approach. The best PSNR achieved is 28.548 dB.
  • 59. 46 Figure 6.5: From top left to bottom right: (a) The images in top row show ground truth image F16-GT and (b) its noisy counterpart using 50% corruption level for salt and pepper noise. The bottom row image left shows (c) original image same as (a), (d) shows noisy image same as (b) and, (e) is the reconstructed image using our approach. The best PSNR achieved is 29.2449 dB.
  • 60. 47 Figure 6.6: From top left to bottom: (a) The images in top row show ground truth image F16-GT and (b) its noisy counterpart using 50% corruption level for salt and pepper noise. The bottom row image shows (c) the reconstructed image using You et al. [11] approach (width = 128). The best PSNR achieved is 28.9 dB.
  • 61. 48 Figure 6.7: From top left to bottom: (a) The images in top row show ground truth image F16-GT and (b) its noisy counterpart using 50% corruption level for salt and pepper noise. The bottom row image shows (c) the reconstructed image using You et al. [11] approach (width = 128) and ADAM optimiser (unless mentioned optimizer is SGD). The image reconstructed becomes noisier as the training continues because the network starts learning noise in absence of early termination.
  • 62. 49 Figure 6.8: (a) The image shows PSNR plot for corresponding Figure in 6.7. It clearly demonstrates the need for early termination. The best PSNR achieved is 28.14 dB before the dip.
  • 63. 50 Figure 6.9: (a) The image shows PSNR plots for different corruption levels for each of the three methods discussed in last chapter. The line plots show best PSNR levels across models for different corruption rates. The plot supports our claim in observations that with increasing corruption rate the best PSNR level reached by our all three methods decreases. Also, our method is the most consistent and stable in performance amongst all three methods.
  • 64. 51 Figure 6.10: The plot depicts the effect of number of linear layers for our method when the number goes from 3 to 9. For three independent trials, the average performance is depicted via the dashed line and we see that the number of epochs to reach the highest PSNR value decreases when we increase the number of layers from 3 to 6. From 6 to 9 layers we see a slight increase. The corruption level for all three trials is 50%.
  • 65. Chapter 7 Conclusion and Discussion In this thesis work, we aimed to study the fundamental theory for image restoration tasks, their cause, and artifacts. We started with basic image restoration tasks math- ematical formulation and dived into the conceptual theory of popular deep learning methods such as CNN, autoencoders, and GANs, which are building blocks of many state-of-art denoising algorithms. We also reviewed different noise models, what causal effect they have on the restoration tasks, and relevant perpetual quality measurements used to measure recovery performance, especially for image denoising. We attempted to chronologically categorize the existing work in the image denoising field and analyze the shortcomings of classical spatial transform methods that lead to transform domain methods, followed by the current family of state-of-the-art performance obtained by learned priors with deep neural networks - the NN-based methods. In chapter 4, we introduced seminal work by Ulyanov et al. [10] on deep image priors which are the foundation of the current learning-free untrained network methods. This work led to to a shift in perspective from conventional theory and motivated newer research ideas like [122] and [11]. We saw that despite deep image priors competitive performance with non-local methods like BM3D, there exists a number of non-trivial limitations that hinder a reliable adaption of this technology in practical scenarios, for example, the need for early stopping. There also was an incomplete understanding with respect to the method’s network generalization performance and quantification. Since 2017, we saw growth in understanding as to why CNN’s or overparameterized networks general- ized so well and the hardness of NN. Since then, many have come up with interesting 52
  • 66. 53 ideas improving untrained networks paradigms and making them more robust towards their limitations. [11] proved that algorithm bias of discrepant learning rates in a doubly over-parameterized network will eliminate any need for tuning learnable parameters and early stopping. We exploit this work and in [141] to propose a new method for single image denoising. We argue why this methodology might work and be a step towards building a generalized image denoising algorithm There are many limitations of this work. While we suggest a probable method- ology exploiting the best of two ideas as discussed in chapter 5, it is not without its shortcomings. As we saw in [10], there is an explicit need for early stopping. This problem does not exist for [11] but only if we use gradient descent as an optimizer. For Adam optimizer, the need for early stopping still exists, and this is also reflective in our proposed method. Another major disadvantage is that [11] and [10] cited results for impulse and Gaussian noise models, i.e., sparse corruptions (results can be extrapolated for shot and speckle noise models), but these methods may not work for other types of noise models (non-sparse corruptions) such as Defocus Blur, Elastic noise, etc. [151]. The DIP paper argues that their method can work for complex noise models in a blind denoising process, but experimental analysis shows that the results are not competitive as compared to the classical methods such as BM3D. Further, with more complex noise models, SGD as an optimizer might not work, leaving us with a need for early stopping yet again. For our proposed algorithm, we still need to explore its fallacies for more noise models, corner cases and need to find a way to mitigate overfitting. The results presented in the last chapter are were part of a preliminary investigation and did not demonstrate the generalization capacity of the algorithm. Thus, we proposed a method that eliminates the need of learnable parameter tun- ing, early stopping and gives better results as compared to the DIP or [11] for simple noise models. But our method is still not robust to all 19 types of noise models in [151]. Many recent works suggest interesting ideas like TV regularized DIP [71] or Bayesian DIP [122], and there is definite potential for extrapolating these methods with our approach of deep linear autoencoders with skip connections, algorithmic bias of discrepant learning rates, and SGD. Thus, it is still an open challenge to develop a generalized algorithm that handles multiple corruptions in a stable manner.
  • 67. References [1] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007. [2] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, PP, 08 2016. [3] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M. Summers. Chestx-ray8: Hospital-scale chest x-ray database and bench- marks on weakly-supervised classification and localization of common thorax diseases. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017. [4] Fahad Shamshad, Muhammad Awais, Muhammad Asim, Zain ul Aabidin Lodhi, Muhammad Umair, and Ali Ahmed. Leveraging deep stein’s unbiased risk esti- mator for unsupervised x-ray denoising, 2018, 1811.12488. [5] Wikipedia. Artificial neural network. https://en.wikipedia.org/wiki/ Artificial_neural_network#/media/File:Neuron3.png. [6] Malcolm Sambridge. An introduction to Inverse Problems. http://web.gps. caltech.edu/classes/ge193.old/lectures/Lecture1.pdf. [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015, 1512.03385. [8] Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders, 2021, 2003.05991. 54
  • 68. 55 [9] Marc Lebrun. An analysis and implementation of the bm3d image denoising method. Image Processing On Line, 2:175–213, 08 2012. [10] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In- ternational Journal of Computer Vision, 128(7):1867–1888, Mar 2020. [11] Chong You, Zhihui Zhu, Qing Qu, and Yi Ma. Robust recovery via implicit bias of discrepant learning rates for double over-parameterization, 2020, 2006.08857. [12] Joanna J. Bryson. The past decade and future of ai’s impact on society. [13] Jack Copeland. The cyc project. [14] Rupali Ahuja Rajshree. A general review of image denoising techniques. [15] Mukesh Motwani, Mukesh Gadiya, Rakhi Motwani, and Frederick Harris. Survey of image denoising techniques. 01 2004. [16] S M A Sharif, Rizwan Ali Naqvi, and Mithun Biswas. Learning medical image denoising with deep dynamic residual attention network. Mathematics, 8(12), 2020. [17] Dang Thanh, Surya Prasath, and Hieu Le Minh. A review on ct and x-ray images denoising methods. Informatica, 43:151–159, 06 2019. [18] Wikipedia. Deep learning. https://en.wikipedia.org/wiki/Deep_learning# Deep_neural_networks. [19] Awan-Ur-Rahman. What is artificial neural network and how it mimics the human brain? [20] MIT. Inverse problems. http://web.mit.edu/2.717/www/inverse.html. [21] Encyclopedia of Mathematics. Ill-posed problems. https:// encyclopediaofmath.org/wiki/Ill-posed_problems. [22] Wikipedia. Inverse problem. https://en.wikipedia.org/wiki/Inverse_ problem#:~:text=An%20inverse%20problem%20in%20science,measurements% 20of%20its%20gravity%20field.
  • 69. 56 [23] Tom Tirer and Raja Giryes. Image restoration by iterative denoising and backward projections. IEEE Transactions on Image Processing, PP, 10 2017. [24] Y. Bengio. Learning deep architectures for ai. Foundations, 2:1–55, 01 2009. [25] Juergen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61, 04 2014. [26] Weibo Liu, Zidong Wang, Xiaohui Liu, Nianyin Zeng, Yurong Liu, and Fuad Alsaadi. A survey of deep neural network architectures and their applications. Neurocomputing, 234, 12 2016. [27] Kaiming He and Jian Sun. Convolutional neural networks at constrained time cost, 2014, 1412.1710. [28] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway net- works, 2015, 1505.00387. [29] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford Uni- versity Press, Inc., USA, 1995. [30] W. Venables and B. Ripley. Modern applied statistics with s fourth edition. 2002. [31] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014, 1406.2661. [32] Wikipedia. Generative adversarial network. https://en.wikipedia.org/wiki/ Generative_adversarial_network. [33] Jason Brownlee. A gentle introduction to generative adver- sarial networks (gans). https://machinelearningmastery.com/ what-are-generative-adversarial-networks-gans/. [34] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans, 2016, 1606.03498. [35] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks, 2018, 1611.07004.