IAC 2024 - IA Fast Track to Search Focused AI Solutions
Generative Adversarial Networks for Robust Medical Image Analysis.pdf
1. UNIVERSIDAD DE LOS ANDES
DEPARTMENT OF BIOMEDICAL ENGINEERING
Generative Adversarial Networks for Robust Medical
Image Analysis
A THESIS PRESENTED FOR THE DEGREE OF MASTER OF SCIENCE
by
Maria Camila ESCOBAR PALOMEQUE
Under the supervision of
Dr. Pablo Andrés ARBELÁEZ
Members of the Qualifying Examination Committee
Dr. Marcela HERNÁNDEZ, Universidad de Los Andes
&
Dr. Mario Andrés VALDERRAMA, Universidad de Los Andes
December 8, 2020
2. Generative Adversarial Networks for Medical Image
Analysis
Maria Camila Escobar
M.Sc. Student
Bogotá, Colombia
mc.escobar11@uniandes.edu.co
Abstract—Deep Learning models have been widely used for
medical imaging tasks such as segmentation. However, these
models tend to have low performances when applied to images
that do not resemble the training dataset distribution. Thus,
the robustness of medical segmentation models can be affected
by external factors such as the quality of the input image, or
by synthetic modifications such as adversarial attacks. In this
work we present two novel approaches to increase robustness
in medical segmentation by using Generative Adversarial Net-
works. First, we present UltraGAN, a method to improve the
robustness to quality of ultrasound segmentation. Second, we
present MedRobGAN, a method to generate adversarial examples
that can later be used in improving the adversarial robustness
for various 3D segmentation tasks. We validate the effectiveness
of our methods through extensive experiments and make a
comprehensive analysis on how Generative Adversarial Networks
can improve medical segmentation tasks.
Index Terms—Deep Learning, Medical segmentation, Genera-
tive Adversarial Networks, Robustness, Image quality, Adversar-
ial Attacks.
I. INTRODUCTION
The analysis and processing of medical images has been
one of the greatest technological advances in medical practice.
Nowadays there are different image acquisition techniques
such as magnetic resonance imaging (MRI), computed axial
tomography (CT), X-rays, ultrasonography, among others.
These techniques are used for various applications, from the
diagnosis of a specific pathology to the planning of surgeries
with a high level of complexity. With the increasing use of
Deep Learning (DL) methods in all types of computer vision
applications, there has been a surge in using these methods for
medical imaging [1]. The biomedical community in general
is accepting the use of these new techniques thanks to the
potential they have to interpret medical images and create
relevant representations in problems such as classification,
detection or segmentation.
Even though DL models are increasingly becoming more
accurate in achieving the task they are trained for, most
of the time they fail when applied to images with slightly
different characteristics. The model’s capability of having a
good performance on images with diverse characteristics is
known as robustness. Because medical datasets usually have
standardized protocols for the acquisition of images, these tend
to only include images taken by a few expert physicians and
most likely with the same acquisition device. DL medical
models are then trained using this data and tend to have a
low performance when evaluated on real-life data from any
physician or a different brand of acquisition device. Thus,
current DL medical models may not be robust to variations
in the quality of the data.
Recently, a new type of robustness assessment, known as ad-
versarial attacks [46], has been studied by the computer vision
community. This assessment is based on adversarial examples,
which are almost imperceptible intensity perturbations to the
original image. However, these perturbations are specifically
designed to trick the model into failure cases. For the task
of medical semantic segmentation, recent studies [6] support
that adversarial examples greatly hurt state-of-the-art models.
These results highlight the importance of developing methods
that are robust to any type of perturbation.
Generative Adversarial Networks (GANs) [21] are a type
of generative models that learn the statistical representation
of the training data. GANs have successfully tackled image-
to-image translation problems [14], [22], [36], including but
not limited to: image colorization [18], super resolution [34],
multi-domain and multimodal mappings [32], and image en-
hancement [17]. In the medical field, several works have
introduced GANs into their approach for tasks that include
data augmentation [15] and image synthesis [7], [8], [35].
GANs are able to learn feature representations from training
images. Thus, the training process of a GAN can be optimized
to learn robust features from the images and generate new
examples than can help for robustness to adversarial attacks
[9] or to classical perturbations in the images [25], [26], [28].
In this work we present two different approaches to increas-
ing robustness in medical segmentation by using GANs. First,
we present UltraGAN [2], a novel framework for ultrasound
image enhancement through adversarial training. Our method
receives as input a low-quality image and performs high-
quality enhancement without compromising the underlying
anatomical structures of the input. The use of the images
generated by UltraGAN improves the robustness to quality
in a state-of-the-art medical segmentation model. Second, we
create a Medical Robust GAN (MedRobGAN) that generates
3D adversarial examples by optimizing the change in appear-
ance of different anatomical structures. Using the volumes
generated by MedRobGAN as additional data for medical
segmentation models can improve the adversarial robustness
3. in some datasets.
II. RELATED WORK
A. Neural Networks
Artificial Neural Networks (ANN) are computational sys-
tems partially inspired by biological neural networks [3]. Their
goal is to receive inputs and find the nonlinear relationship
between them in order to predict the corresponding outputs.
ANN have processing units, called neurons, that receive an
input and multiply it by the neuron weight. Afterwards, these
neurons are connected in different layers that sum up the
output given by each neuron and pass it through a non-
linear activation function to obtain a final output. ANNs have
a different amount of neurons and layers according to the
problem that they are trying to solve. The learning process of
a ANN consists on finding the right weights for each neuron
that will transform the input into the desired output. Eq. 1
describes the output of a neuron, where xi is the activation of
the previous neuron, wi is the weight that the current neuron
gives to each of the xi, b is an additional bias given to the
entire output of the current neuron and f(z) = is a non-linear
activation function. The final output of the current neuron is
y.
y = f
X
i
xiwi + b
(1)
For the task of medical image segmentation, the input of
the ANN is usually a 2D or 3D matrix with the information
of the image or volume and the output is a segmentation of a
region of interest from the input. The learning process of the
network consists on comparing the output with the groundtruth
information and modifying the individual weights and biases
of the neurons in order to get closer to the desired output. This
process is done iteratively and the final results depends heavily
on choosing the appropriate loss function for comparing the
ANN’s output with the groundtruth.
B. Adversarial attacks
The result given by an ANN can be extremely sensitive
to small modifications in the input image. These changes,
known as adversarial perturbations [50], are imperceptible by
the human eye but are able to fool top-performing systems
by dropping their performance virtually to 0 [39]. There is an
increasing amount of works regarding adversarial attacks and
defenses for diverse tasks [4], [5], [37], [42], [46], [47], [50],
[52].
Due to their direct impact in human health, medical image
systems are of special importance to research in adversarial
robustness. There are several works studying robustness in the
medical domain for tasks such as image classification [45]
and image segmentation [38], [41], [43], [44], [48]. Recently,
Daza et al. [6] developed a comprehensive framework for
adversarial robustness in the task of 3D medical segmentation.
Their approach includes a set of possible attacks that can be
done to reduce the performance of a 3D segmentation model
as well as a new model for general medical segmentation that
is robust to adversarial perturbations. For our Medical Robust
GAN we use the implementation of the attacks and the 3D
segmentation model from [6].
C. Generative Adversarial Networks
GANs consist of a generator, that is in charge of produc-
ing new images, and a discriminator, that is in charge of
discerning between real data from the existing database and
new data created by the generator. The learning process of
a GAN consists on a min-max game between the generator
and the discriminator, where the generator tries to fool the
discriminator by producing realistic-looking images and the
discriminator tries to identify the synthetic images created by
the generator.
For the task of ultrasound quality robustness, some au-
tomated methods have been developed for image enhance-
ment [19], [23], [24], [31]. Liao et al. [28] proposed a quality
transfer network to enhance ultrasound images. The algorithm
was tested in echo view classification, showing that quality
transfer improves the performance. Additionally, Lartaud et
al. [26] trained a convolutional network for data augmentation
by changing the quality of the images to create contrast
and non-contrast images for segmentation. The augmented
data improved their segmentation method. On the same note,
Jafari et al. [25] trained a model to transform quality between
ultrasound images. The approach introduced a segmentation
network in the training of the GAN to provide an anatomi-
cal constraint added by the segmentation task. Nevertheless,
these methods were developed and evaluated on private data,
complicating the possibility of a direct comparison.
Finally, for the task of medical segmentation adversarial
robustness, Chen et al. [16] combined a GAN with a Varia-
tional Autoencoder to generate images with deformations and
appearance changes that can be used to attack medical segmen-
tation models. The generated adversarial examples included
geometrical deformations as well as intensity variations. They
were able to attack a 2D medical segmentation network.
However, they did not use the generated images to improve
the robustness of their segmentation model. In contrast, our
Medical Robust GAN is able to generate 3D adversarial attacks
and then use the generated data to increase the robustness of
the segmentation models.
III. METHOD
In this section we explain our two approaches for gen-
erating images that can increase the robustness of medical
segmentation methods. First we explain UltraGAN, a method
for improving quality robustness, and then MedRobGAN, a
method for improving adversarial robustness.
A. UltraGAN
Our method named UltraGAN consists of a Generative
Adversarial Network designed to enhance the quality of ul-
trasound images without compromising underlying anatomical
information.
2
4. Generator
Enhanced
Original Reconstruction
real/fake
Enhanced
Segmentation
channel-wise
concatenation
Discriminator
Fig. 1. Overview of our generation scheme for UltraGAN. We add a frequency consistency loss to preserve fine details and coarse structures. We concatenate
the segmentation map along with the input image for the discriminator to classify as real or enhanced. This particular case corresponds to an enhanced input.
1) Problem formulation: We have a set of low-quality
ultrasounds {li}
N
i=1 ∈ L with a data distribution l ∼ pdata(l)
and a set of high-quality ultrasounds {hi}
N
i=1 ∈ H with a
data distribution h ∼ pdata(h). Our main objective is to learn
mapping functions that translate from low to high-quality
domain and vice versa. Thus, we have a generator for each
domain translation GH : L → H and GL : H → L. We also
have two discriminators: DH distinguishes between real high-
quality images hi and generated high-quality images GH(li),
and DL distinguishes between real low-quality images li and
generated low-quality images GL(hi).
We want to preserve the structural information from the
original image. Therefore, we include the segmentation of
the anatomical regions of interest for the high-quality sh or
for the low-quality sl ultrasound as additional input for the
discriminators.
2) Model: Our generator (G) builds upon the CycleGAN
architecture [36] which is the most common used frame-
work for Image-to-Image translation. CycleGAN consists of
down-sampling layers, followed by residual blocks and up-
sampling layers. For the discriminator (D), we build upon
PatchGAN [22], [36] that breaks down the image into different
patches and then learns to predict if those patches are real or
generated. Our discriminator has two inputs: the ultrasound
image (whether real or generated) and the corresponding
segmentation of the anatomical regions of interest.
3) Loss functions: Finding the appropriate loss function
is a critical task for image generation because the problem
must have enough constraints to create images similar to the
desired domain. For UltraGAN, we use an identity loss and we
alter the traditional adversarial and cycle consistency losses to
create an anatomically coherent adversarial loss and frequency
cycle consistency losses.
Anatomically Coherent Adversarial Loss: The goal of
the adversarial loss is to make the generated images resemble
the distribution of the real dataset. Inspired by the idea of
conditional GANs [29] and pix2pix [22], we modify the
adversarial loss to include as input the segmentation of the
anatomical regions of interest. For the high-quality translation
networks GH and DH our anatomically coherent adversarial
loss is defined as:
Ladv (GH, DH) = Eh∼pdata(h) [log DH(h, sh)]
+ El∼pdata(l) [log (1 − DH(GH(l), sl)]
(2)
This loss helps the networks to learn the underlying relation-
ship between the anatomical regions of interest and the struc-
tures in the generated image. Furthermore, the segmentation
is not necessary at test time, since we only use the generator.
In our training process we also consider the adversarial loss
for low-quality translation GL and DL.
Frequency Cycle Consistency: The cycle consistency
loss [36] is one of the most important losses in image-to-image
translation because it allows training without paired images.
However, the cycle consistency constraint is a pixel-wise L1-
norm between the original image (l) and the reconstruction
(GL(GH(l))), which enforces the output to have exactly the
same intensities. Yet, during the process of ultrasound quality
enhancement, it is more useful to think of the image in terms
of frequency rather than intensity [20]. As it can be seen
in Fig. 2, low frequencies contain the coarse information of
an image, while high frequencies contain the fine details.
With this concept in mind, we create two types of frequency
consistency losses to improve quality enhancement.
Fig. 2. Frequency extraction of ultrasound image. The low frequency image
contains the coarse anatomical structures while the high frequency image
contains detailed information of the division between the organs.
During high-quality translation, we aim to preserve the
anatomical information present in the low frequencies of the
original image. To extract low frequencies, we pass the images
through a Gaussian pyramid [30] φ at K = 3 scales, then
3
5. Fig. 3. Overview of the adversarial optimization process in MedRobGAN. Both the original image and the groundtruth are inputs to the GAN.
compute the L1-norm between the structural information of
the original and the generated image (Eq. 3). Our generators
transfer image details of the high-quality domain in the form
of high frequencies. Therefore, we obtain those frequencies
through a Laplacian pyramid [30] γ at K = 3 scales and
calculate the L1-norm between the high frequencies of the
original image and the high frequencies of the reconstruction
(Eq. 4). The loss concept is better illustrated in Fig. 1.
Llf (GH) =
K
X
k=1
kφk(l) − φk (GH(l))k1 (3)
Lhf (GH, GL) =
K
X
k=1
kγk(l) − γk (GL(GH(l)))k1 (4)
Identity Loss: The identity loss is particularly useful for
the application of quality enhancement in real-life clinical
scenarios because it ensures that the generator does not modify
images from the same domain. In real-life applications we
will not have a quality label for each ultrasound image but
we would still want to perform quality enhancement without
damaging the images that already have a high-quality. We
achieve this by using a L1-norm between the original high-
quality image and the same image after going through the
quality enhancement process.
Lidt = kh − GH(h)k1 (5)
Overall Loss: Our overall loss is defined as the weighted
sum of the losses in both pathways H → L and L → H, where
each λ represents the relative importance of each loss function
in the system. Even though we train two generators and two
discriminators, at inference time we only use the generator
GH since our goal is to improve the quality of all ultrasound
images.
LUltraGAN = λadvLadv + λlf Llf + λhf Lhf + λidtLidt. (6)
B. Medical Robust GAN
With our Medical Robust GAN (MedRobGAN) we aim to
generate adversarial examples that are particularly challenging
for the segmentation network. These hard examples can be
later used for finetuning the segmentation network with the
expectation of improving the overall robustness. Our method
is divided into two main parts: generation and adversarial
optimization.
1) Generation: In this stage we adapt the state-of-the-art
GAN for 2D image generation: SEAN [10], in order to make
it suitable for 3D volume generation. Because 3D volumes
take more memory space than 2D images, we have to reduce
the amount of parameters in SEAN for it to be feasible to
train. We keep the same training scheme and loss functions as
the original SEAN model with the exception of the perceptual
loss that we do not implement. The principal advantage of
using SEAN as our baseline is that it allows us to control the
style of different parts of the volume if there is a segmentation
mask to go along with it. Thus, for medical applications we
are able to modify only the part of the volume that includes
a certain organ or a tumor and we can copy the styles from
one volume to another. This style manipulation is achieved
through modulation parameters that modify the mean and the
variance of each pixel by using the segmentation information
after batch normalization [13]. The entire normalization layer
[10] can be seen in Eq. 7 where x is the activation of
the previous convolutional layer, η is an additional noise to
increase variability, γ and β are the modulation parameters,
and µc and σc are the channel-wise mean and variance used
for batch normalization.
γ
(x + η) + µc
σc
+ β (7)
2) Adversarial optimization: For the adversarial optimiza-
tion stage we use the generator of our Robust Medical GAN
with the already trained weights (U) and a pretrained seg-
mentation model (V). As we mention before, a critical step
for volume generation based on styles are the modulation
parameters in the normalization block. Thus, we aim to find,
through adversarial optimization, the value of η, γ and β for
U that will generate a volume that is hard to segment by V.
The main advantage of optimizing modulation parameters is
that we can modify an object’s appearance while keeping the
original pose. For example, we can modify a liver by changing
the style to something less identifiable by V, but we retain the
spatial and geometrical information so that the generated organ
does not deviate from the original class. Fig. 3 illustrates the
adversarial optimization process.
4
6. To find the optimal η, γ and β we use Projected Gradient
Descent (PGD) [46] with the steepest descent under the L∞-
norm. The process is shown in Eq. 9, we start from a noise
δ that is added to each of the three parameters, and we pass
the original volume (j) through the generator U. The output
of U is a new volume (z) with a different style given by
γ + δ, β + δ and η + δ. Then, we calculate the loss function
between the groundtruth segmentation (s) and the output of
V(z). Since our goal is to generate the volume that confuses
V the most, we move δ towards the direction that maximizes
the loss, which is equivalent to moving it in the direction
of the sign, and increase it by a factor of α. Additionally,
to control the effect of δ we include the standard constraint
= 8/255, meaning that δ cannot be higher than or lower
than −. This optimization process is done iteratively for k
steps to find the hardest example possible. The more steps we
add to the adversarial optimization, the harder it is for V to
segment z but there is also a higher probability of generating
a volume with an unrelated style that would no longer be
useful for training. This problem happens when the adversarial
optimization finds modulation parameters that result in a hard
example but are visually unrealistic or very different from
the original dataset. Thus, when we use these images for
finetuning, it would confuse the network rather than help with
the adversarial robustness. We have to find a trade-off between
how hard the example is and how real it looks.
z = U(j, s, δ) (8)
δ = δ + α × sign(Loss(V(z), s)) (9)
IV. EXPERIMENTS
A. UltraGAN
1) Dataset: To validate UltraGAN, we use the publicly
available “Cardiac Acquisitions for Multi-structure Ultrasound
Segmentation” (CAMUS) dataset [27]. The CAMUS dataset
contains 2D Ultrasound images and multi-structure segmenta-
tions of 450 patients. Each of the ultrasound images in the CA-
MUS dataset comes with a quality-assessment given by expert
physicians. Besides, the CAMUS dataset includes pathological
patients that have different left ventricle ejection fractions,
making it a realistic problem in which not all the anatomical
structures are perfect. The task in the CAMUS dataset is
to segment the left ventricular endocardium (LVEndo), left
ventricular epicardium (LVEpi) and left atrium (LA) in two
chamber (2CH) and four chamber (4CH) views for End of
Diastole (ED) and End of Systole (ES).
2) Experimental Setup: We train UltraGAN with 80% of
the images and evaluate on the remaining 20%. In the dataset,
each image can be labeled as high-quality, medium-quality
or low-quality. However, for our experiments we consider
medium-quality images in the same group as low-quality im-
ages. Thus, allowing us to be more strict during the ultrasound
enhancement process. To ensure that every component of our
networks has a relevant contribution, we enhance low-quality
ultrasound images using three variants of our system for the
ablation experiments:
TABLE I
CHARACTERISTICS OF THE CAMUS DATASET
Characteristic Number of patients
High-quality 198
Medium/Low-quality 252
EF within standard range 141
EF lower than 45% 222
EF higher than 55% 87
• Without anatomically coherent adversarial loss.
• Without frequency cycle consistency losses.
• Without anatomically coherent or frequency cycle con-
sistency losses (CycleGAN).
Nevertheless, the evaluation of image quality in an unpaired
setup is a subjective process and performing perceptual stud-
ies would require expert physicians spending their time in
analyzing which ultrasound images were correctly enhanced.
However, we make the assumption that, as in real-life, it is
easier to identify anatomical structures in high-quality images
than in low-quality images. Therefore, we use multi-structure
segmentation as a down-stream quantitative metric.
We train an U-Net model [33] for segmentation and evaluate
using 10 fold cross-validation splits of the CAMUS dataset as
done in [27]. Then, we use UltraGAN to enhance the quality
of all the training images and train the same U-Net with the
original images as well as the enhanced augmentation. We
compare the segmentation results by using the Dice score
between the groundtruth (s) and the predicted segmentation
(p) for each anatomical structure.
Dice =
2|s ∩ p|
|s| + |p|
× 100 (10)
B. Medical Robust GAN
1) Dataset: Since our goal is to generate adversarial exam-
ples that can increase the robustness of a medical segmentation
model, we evaluate our framework on the most challenging di-
agnostic problems. This includes seven tasks from the Medical
Segmentation Decathlon (MSD) [49], which is the standard
framework for medical image segmentation, and the Kidney
and Kidney Tumor Segmentation dataset (KiTS) [11]. Each
dataset includes 3D volumes (MRI or CT) and a segmentation
with the desired organs and lesions.
TABLE II
CHARACTERISTICS OF THE 8 CHOSEN DATASETS TO ASSESS
ADVERSARIAL ROBUSTNESS.
Dataset Modality Target Training volumes Validation volumes
Heart MRI Left Atrium 16 4
Liver CT Liver and tumor 94 37
Hippocampus MRI Hippocampus head and body 208 52
Pancreas CT Pancreas and tumor 225 57
Hepatic Vessel CT Hepatic vessels and tumor 212 91
Spleen CT Spleen 33 8
Colon CT Colon cancer primaties 88 38
KiTS CT Kidney and tumor 168 42
5
7. Low Quality
Images
Enhanced
Images
Fig. 4. Qualitative comparison of the low-quality and enhanced images using UltraGAN. Our method is able to enhance ultrasound images, improving the
interpretability of the heart structures regardless of the view.
2) Experimental Setup: The adversarial examples that our
Medical Robust GAN generates are used as additional data for
finetuning ROG [6], a state-of-the-art medical segmentation
network. The motivation for doing this finetuning is that it
should help the segmentation network to learn more discrim-
inative features and therefore be more robust to adversarial
perturbations. Afterwards, we assess the robustness of ROG
with and without the additional finetuning stage. For a better
benchmarking, we also compare our method with the Free
Adversarial Training (FreeAT) defense for ROG as shown in
[6]. For the adversarial attack we use AutoPGD-CE [6] that
operates by maximizing the Cross Entropy loss [12] averaged
across all spatial locations. The concept behind this attack
is the same as shown in section III-B2 and we perform
experiments for 5, 10 and 20 iterations. We also evaluate
the performance of the standard and robust methods under
no attack (0 iterations) to obtain a complete framework. The
metric for evaluating the performance of each method is the
Dice Score.
Having 8 different datasets allows us to explore the potential
of our Medical Robust GAN under different circumstances.
For each dataset we train a GAN, perform adversarial opti-
mization to generate hard examples and finetune ROG. We
keep the original train and validation division and use the
pretrained weights from [6].
V. RESULTS
A. UltraGAN
1) Image Enhancement: UltraGAN provides an image
quality enhancement that is noticeable even for untrained eyes.
Fig. 4 shows the comparison between low-quality images and
the enhanced images we generate. In the enhanced images,
the heart’s chambers are easier to recognize because they have
sharper boundaries. These results are consistent for both 2CH
and 4CH views.
Furthermore, in Fig. 5 we demonstrate that UltraGAN
generates better high-quality images than the traditional Cy-
cleGAN. For our ablation experiment in Fig. 6 we see the
effect that removing one of our components has on quality
enhancement. The images enhanced without the anatomically
coherent adversarial loss maintain finer details, yet the system
tends to hallucinate high frequencies in the left part of the
image. Conversely, if we do not use the frequency cycle
consistency losses, the structure is preserved but there is not
a well definition of heart regions. Overall, with UltraGAN we
are able to create an image quality enhancement that takes into
account frequency and structural information.
CycleGAN
results
UltraGAN
results
Low Quality
Images
Fig. 5. Qualitative comparison between CycleGAN results and UltraGAN.
The images generated by CycleGAN are perceptually similar to the original
low-quality images. In contrast, images enhanced by UltraGAN show a clear
difference between anatomical structures.
No Frequency
consistency
Enhanced Training
No Anatomical
coherence
No Anatomical
coherence and
No Frequency
consistency
Original
Poor Quality
Fig. 6. Ablation examples of UltraGAN. We show the results obtained for
every stage of the generation.
2) Multi-structure segmentation: In Fig. 7 we show that
some of the segmentations obtained by using the standard data
have artifacts, while training with UltraGAN-enhanced images
improves the resulting segmentation. Also, Table III shows the
Dice Scores for this experiment. Here we confirm that for each
of the structures present in the ultrasound image, augmenting
the training data with UltraGAN improves the segmentation
results. This improvement is also consistent across all of the
image qualities, suggesting that the baseline with enhanced
training data preserves correctly the anatomical structures
6
8. TABLE III
SEGMENTATION RESULTS FOR 10-FOLD CROSS-VALIDATION SET COMPARING STANDARD TRAINING VS TRAINING WITH ULTRAGAN.
Method
High (%) Medium (%) Low (%)
LVEndo LVEpi LA LVEndo LVEpi LA LVEndo LVEpi LA
Baseline 93.07 86.61 88.99 92.02 85.32 88.13 90.76 83.10 87.52
Our method 93.78 87.38 89.48 92.66 86.20 88.38 91.55 83.75 87.84
Groundtruth Baseline Training Enhanced Training Groundtruth Baseline Training Enhanced Training
Fig. 7. Qualitative results for heart segmentation in the CAMUS dataset by using our enhanced images as data augmentation in the training stage. We present
two different test examples showing the groundtruth (columns 1 and 4), the baseline results (columns 2 and 5) and the improved segmentation (columns 3
and 6).
TABLE IV
SEGMENTATION RESULTS FOR 10-FOLD CROSS-VALIDATION COMPARING
THE STATE-OF-THE-ART VS. OUR QUALITY ENHANCED TRAINING.
Image quality Method
ED (%) ES (%)
LVEndo LVEpi LVEndo LVEpi
High Ours 94.40±0.7 86.54±1.2 92.04±1.1 87.05±1.4
+ Medium Leclerc et al. 93.90±4.3 95.40 ±2.3 91.60±6.1 94.50±3.9
Low
Ours 93.00±1.1 83.57±1.9 90.10±1.3 83.93±2.7
Leclerc et al. 92.10±3.7 94.70±2.3 89.80±5.7 93.67±3.2
present in the ultrasound images. We evaluate separately
the segmentation of our enhanced images in a subset of the
CAMUS dataset consisting of patients at pathological risk with
a left ventricle ejection fraction lower than 45%. We find that,
for pathological cases, the average Dice score (89.5%) is as
good as for healthy patients (89.7%).
Table IV shows the comparison between the state-of-the-
art method in the CAMUS dataset and our quality enhanced
method for the High+Medium and Low-qualities in the 10-fold
cross-validation sets. We do not include the comparison for
Left atrium segmentation since the authors do not report their
performance on that class. [27] uses a modified U-Net network
that has more parameters than the U-Net we used. Here we
demonstrate that enhancing the quality of training images
we are able to improve the robustness of the segmentation
model towards different quality ultrasounds. Also, even with
a simpler network with less amount of parameters, our robust
model is able to outperform state-of-the-art approaches in left
ventricular endocardium segmentation, and obtain competitive
results in left ventricular epicardium segmentation. Thus,
demonstrating that the inclusion of quality enhanced images
during training can benefit a model’s generalization.
B. Medical Robust GAN
Fig. 8 shows the performance under adversarial attack of
different variants of the medical segmentation model for each
task of the 8 datasets. Each graph includes the performance
of the normal segmentation model (blue line) and three
lines corresponding to the performance of the same model
finetuned with different adversarial examples generated by
MedRobGAN. The three versions of adversarial examples
were generated through the optimization explained in section
III-B2 by setting the amount of iterations k to 5 (orange line),
10 (green line) and 20 (red line). The x axis of the graphs in
Fig. 8 denotes the number of iterations done in the adversarial
attacks. Analogous to the optimization stage, a higher number
of attack iterations represent a stronger attack to the network.
Thus, the performance of the normal segmentation network
drops when the attack iterations increase. The goal of using
adversarial examples during the finetuning is to increase the
adversarial robustness of the model which can be understood
in the graph as retaining a high Dice score throughout the
attack iterations, i.e. the dashed lines should be higher than
the solid blue line.
We observe that the effectiveness of finetuning with ad-
versarial examples varies greatly depending on the dataset.
For hippocampus, pancreas, hepatic vessel and colon there is
no advantage in using our adversarial examples. However, for
liver, spleen, KiTs and heart there is a significant increase
in robustness when finetuning with adversarial examples. An-
other observation we find is that there is a lack of consistency
in the ranking of the numbers of iterations for the adversarial
optimization. For example, for the liver dataset, the best
performing model was finetuned with 10 iterations while the
best performing model for the spleen dataset is finetuned with
5 iterations.
7
9. 0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Spleen
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Pancreas Organ
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Pancreas Lesion
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Liver Organ
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Liver Lesion
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
KiTS Organ
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
KiTS Lesion
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Heart
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Colon
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Hippocampus Body
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Hippocampus Head
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Hepatic Vessel Organ
Normal
Adv5
Adv10
Adv20
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Hepatic Vessel Lesion
Normal
Adv5
Adv10
Adv20
Fig. 8. Adversarial robustness for each task of the 8 datasets. We compare the pretrained ROG [6] that only has clean images (blue line) with the finetuned
ROG with 5 (orange line), 10 (green line) and 20 (red line) iterations. We show the results across different attack iterations of APGD-CE.
8
10. Goundtruth
Prediction - clean image Prediction - adversarial image
ROG + MedRobGAN
ROG ROG + MedRobGAN
ROG
Fig. 9. Qualitative results of adversarial robustness for Spleen (top row) and heart (bottom row). Using MedRobGAN for data augmentation significantly
increases the accuracy of the prediction in adversarial images.
We report qualitative results in Fig. 9, the top row corre-
sponds to spleen and the bottom row corresponds to heart (left
atrium). Overall, we observe that the segmentation generated
by ROG in the adversarial image is completely inaccurate.
However, using MedRobGAN as data augmentation has a
significant improvement in the segmentation of the attacked
image. Additionally, for the clean image, both ROG and
ROG+MedRobGAN have comparable results.
We also perform a comparison between the segmentation
model trained with a standard adversarial defense known as
Free Adversarial training (+FreeAT) [6] and our best adver-
sarial examples (+MedRobGAN). Fig. 10 shows the results
of the comparison for our best-performing datasets. We find
that the adversarial examples created by our MedRobGAN
improve the robustness of the model for liver segmentation
in a greater extent than FreeAT. However, for the remaining
3 datasets, using FreeAT gives a better performance than
using our MedRobGAN. These results suggest that there is
still room for improvement in finding the adequate training
hyperparameters for the GAN and the ideal strength and
number of iterations for the adversarial optimization. However,
this work is the first time that a GAN framework is used
for increasing the robustness of generic medical segmentation
and our results so far prove that it is possible to achieve an
improvement.
VI. CONCLUSION
In this work we present two novel methods that use GANs
to tackle the robustness of medical segmentation models:
• First, we present UltraGAN, a method designed for
quality enhancement of ultrasound images. We achieve
enhancement of 2D echocardiography images without
compromising the anatomical structures. By using multi-
structure segmentation as a downstream task we demon-
strate that augmenting the training data with enhanced
images improves the robustness. We expect UltraGAN to
be useful in other ultrasound problems to push forward
automated ultrasound analysis.
• Second, we present MedRobGAN, a method that inte-
grates adversarial optimization into the GAN framework
to create adversarial examples that are semantically hard
for a medical segmentation network. We evaluate our
framework in 8 diverse datasets for medical segmentation.
Using the adversarial examples for training achieves
competitive results in adversarial robustness for half
of the datasets. This results are promising in order to
keep exploring the potential of GAN augmentation in
adversarial robustness.
ACKNOWLEDGMENTS
This thesis is done as part of an ongoing project with Angela
Castillo, I am deeply thankful for her immense contributions.
I would also like to thank professor Pablo Arbeláez for his
guidance during this project and throughout my development
as a researcher. Finally, I thank the Biomedical Computer
Vision group for their advice and support.
REFERENCES
[1] Litjens, G. , Kooi, T. , Bejnordi, B.E. , Setio, A.A.A , Ciompi, F. ,
Ghafoorian, M. , Van Der Laak, J.A. , Van Ginneken, B. , Sánchez,
C.I.: A survey on deep learning in medical image analysis. In: Medical
image analysis vol 42, pp 60-88. Elsevier (2017)
[2] Escobar, M., Castillo, A., Romero, A., Arbeláez, P.: UltraGAN: Ul-
trasound Enhancement Through Adversarial Generation. In International
Workshop on Simulation and Synthesis in Medical Imaging (pp. 120-
130). Springer, Cham. (2020)
[3] LeCun et al., ”Backpropagation Applied to Handwritten Zip Code
Recognition,” Neural Computation, 1, pp. 541–551, 1989.
9
11. 0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Spleen
Normal
+MedRobGAN
+FreeAT
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Liver Organ
Normal
+MedRobGAN
+FreeAT
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Liver Lesion
Normal
+MedRobGAN
+FreeAT
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
KiTS Organ
Normal
+MedRobGAN
+FreeAT
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
KiTS Lesion
Normal
+MedRobGAN
+FreeAT
0 5 10 20
Attack iterations
0.0
0.2
0.4
0.6
0.8
1.0
Dice
Score
Heart
Normal
+MedRobGAN
+FreeAT
Fig. 10. Comparison of adversarial robustness for our best-performing datasets. We compare the pretrained ROG model that only has clean images with two
different defense versions of ROG: finetuned with the adversarial examples generated by MedRobGAN (red line) and the FreeAT presented in [6].
[4] Pérez, J. C., Alfarra, M., Jeanneret, G., Bibi, A., Thabet, A., Ghanem, B.,
Arbeláez, P.: Gabor Layers Enhance Network Robustness. In European
Conference on Computer Vision (pp. 450-466). Springer, Cham.(2020)
[5] Alfarra, M., Pérez, J. C., Bibi, A., Thabet, A., Arbeláez, P., Ghanem,
B.: ClustTR: Clustering Training for Robustness. arXiv preprint
arXiv:2006.07682.(2020)
[6] Daza, L. , Pérez, J.C, Gómez C., Arbeláez P.: Towards Robust General
Medical Image Segmentation. Submitted to CVPR2021
[7] Abdi, A.H., Jafari, M.H., Fels, S., Tsang, T., Abolmaesumi, P.: A
study into echocardiography view conversion. In: Workshop of Medical
Imaging Meets NeurIPS (2019)
[8] Abdi, A.H., Tsang, T., Abolmaesumi, P.: Gan-enhanced conditional
echocardiogram generation. In: Workshop of Medical Imaging Meets
NeurIPS (2019)
[9] Shetty, R. , Fritz, M. , Schiele, B.: Towards automated testing and
robustification by semantic adversarial data generation. In: European
Conference on Computer Vision (ECCV) (2020)
[10] Zhu, P., Abdal, R., Qin, Y., Wonka, P.: SEAN: Image Synthesis
with Semantic Region-Adaptive Normalization. : Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(pp. 5104-5113) (2020).
[11] Heller, N., Isensee, F., Maier-Hein, K. H., Hou, X., Xie, C., Li, F., ...
Yao, G.: The state of the art in kidney and kidney tumor segmentation
in contrast-enhanced CT imaging: Results of the KiTS19 Challenge.
Medical Image Analysis, 67, 101821 (2019)
[12] Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training
deep neural networks with noisy labels. Advances in neural information
processing systems, 31, 8778-8788 (2018)
[13] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep
network training by reducing internal covariate shift. arXiv preprint
arXiv:1502.03167 (2015)
[14] Translation, M. I.: StarGAN: Unified Generative Adversarial Networks
for Multi-Domain Image-to-Image Translation. (2017)
[15] Abhishek, K., Hamarneh, G.: Mask2lesion: Mask-constrained adversar-
ial skin lesion image synthesis. In: Burgos, N., Gooya, A., Svoboda, D.
(eds.) Simulation and Synthesis in Medical Imaging. pp. 71–80. Springer
International Publishing, Cham (2019)
[16] Chen, L., Bentley, P., Mori, K., Misawa, K., Fujiwara, M., Rueckert, D.:
Intelligent image synthesis to attack a segmentation cnn using adversarial
learning. In: Burgos, N., Gooya, A., Svoboda, D. (eds.) Simulation
and Synthesis in Medical Imaging. pp. 90–99. Springer International
Publishing, Cham (2019)
[17] Chen, Y.S., Wang, Y.C., Kao, M.H., Chuang, Y.Y.: Deep photo enhancer:
Unpaired learning for image enhancement from photographs with gans.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pp. 6306–6314 (2018)
[18] Deshpande, A., Lu, J., Yeh, M.C., Jin Chong, M., Forsyth, D.: Learning
diverse image colorization. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. pp. 6837–6845 (2017)
[19] Duarte-Salazar, C.A., Castro-Ospina, A.E., Becerra, M.A., Delgado-
Trejos, E.: Speckle noise reduction in ultrasound images for improving
the metrological evaluation of biomedical applications: An overview.
IEEE Access 8, 15983–15999 (2020)
[20] Fritsche, M., Gu, S., Timofte, R.: Frequency separation for real-world
super-resolution. ICCV Workshop (2019)
[21] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,
D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets.
In: Advances in neural information processing systems. pp. 2672–2680
(2014)
[22] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation
with conditional adversarial networks. In: Computer Vision and Pattern
Recognition (CVPR), 2017 IEEE Conference on (2017)
[23] Jafari, M.H., Girgis, H., Abdi, A.H., Liao, Z., Pesteie, M., Rohling,
R., Gin, K., Tsang, T., Abolmaesumi, P.: Semi-supervised learning for
cardiac left ventricle segmentation using conditional deep generative
models as prior. In: 2019 IEEE 16th International Symposium on
Biomedical Imaging (ISBI 2019). pp. 649–652. IEEE (2019)
[24] Jafari, M.H., Girgis, H., Van Woudenberg, N., Moulson, N., Luong, C.,
Fung, A., Balthazaar, S., Jue, J., Tsang, M., Nair, P., et al.: Cardiac point-
of-care to cart-based ultrasound translation using constrained cyclegan.
International Journal of Computer Assisted Radiology and Surgery pp.
1–10 (2020)
[25] Jafari, M.H., Liao, Z., Girgis, H., Pesteie, M., Rohling, R., Gin, K.,
Tsang, T., Abolmaesumi, P.: Echocardiography segmentation by quality
translation using anatomically constrained cyclegan. In: Medical Image
Computing and Computer Assisted Intervention – MICCAI 2019. pp.
655–663. Springer International Publishing, Cham (2019)
[26] Lartaud, P.J., Rouchaud, A., Rouet, J.M., Nempont, O., Boussel, L.:
10
12. Spectral ct based training dataset generation and augmentation for
conventional ct vascular segmentation. In: International Conference on
Medical Image Computing and Computer-Assisted Intervention. pp.
768–775. Springer (2019)
[27] Leclerc, S., Smistad, E., Pedrosa, J., Østvik, A., Cervenansky, F.,
Espinosa, F., Espeland, T., Berg, E.A.R., Jodoin, P.M., Grenier, T., et al.:
Deep learning for segmentation using an open large-scale dataset in 2d
echocardiography. IEEE transactions on medical imaging 38(9), 2198–
2210 (2019)
[28] Liao, Z., Jafari, M.H., Girgis, H., Gin, K., Rohling, R., Abolmaesumi,
P., Tsang, T.: Echocardiography view classification using quality transfer
star generative adversarial networks. In: International Conference on
Medical Image Computing and Computer-Assisted Intervention. pp.
687–695. Springer (2019)
[29] Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv
preprint arXiv:1411.1784 (2014)
[30] Oliva, A., Torralba, A., Schyns, P.G.: Hybrid images. ACM Transactions
on Graphics (TOG) 25(3), 527–532 (2006)
[31] Ortiz, S.H.C., Chiu, T., Fox, M.D.: Ultrasound image enhancement: A
review. Biomedical Signal Processing and Control 7(5), 419–428 (2012)
[32] Romero, A., Arbeláez, P., Van Gool, L., Timofte, R.: Smit: Stochastic
multi-label image-to-image translation. In Proceedings of the IEEE
International Conference on Computer Vision Workshops (2019)
[33] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks
for biomedical image segmentation. In: International Conference on
Medical image computing and computer-assisted intervention. pp. 234–
241. Springer (2015)
[34] Wang, X., Chan, K.C., Yu, K., Dong, C., Loy, C.C.: Edvr: Video
restoration with enhanced deformable convolutional networks. In: The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Workshops (June 2019)
[35] Yang, H., Sun, J., Carass, A., Zhao, C., Lee, J., Xu, Z., Prince, J.:
Unpaired brain mr-to-ct synthesis using a structure-constrained cyclegan.
In: Deep Learning in Medical Image Analysis and Multimodal Learning
for Clinical Decision Support. pp. 174–182. Springer International
Publishing, Cham (2018)
[36] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image
translation using cycle-consistent adversarial networkss. In: Computer
Vision (ICCV), 2017 IEEE International Conference on (2017)
[37] Anurag Arnab, Ondrej Miksik, and Philip HS Torr. On the robustness
of semantic segmentation models to adversarial attacks. In CVPR, 2018.
[38] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel,
Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and
Alexey Kurakin. On evaluating adversarial robustness. arXiv preprint
arXiv:1902.06705, 2019.
[39] Nicholas Carlini and David Wagner. Towards evaluating the robustness
of neural networks. In 2017 IEEE Symposium on Security and Privacy
(SP), 2017.
[40] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial
robustness with an ensemble of diverse parameter-free attacks. In
International Conference on Machine Learning (ICML), 2020.
[41] Yinpeng Dong, Qi-An Fu, Xiao Yang, Tianyu Pang, Hang Su, Zihao
Xiao, and Jun Zhu. Benchmarking adversarial robustness on image
classification. In CVPR, 2020.
[42] Jan Hendrik Metzen, Mummadi Chaithanya Kumar, Thomas Brox, and
Volker Fischer. Universal adversarial perturbations against semantic
image segmentation. In ICCV, 2017.
[43] Yingwei Li, Zhuotun Zhu, Yuyin Zhou, Yingda Xia, Wei Shen, Elliot K
Fishman, and Alan L Yuille. Volumetric medical image segmentation: A
3d deep coarse-to-fine framework and its adversarial examples. In Deep
Learning and Convolutional Neural Networks for Medical Imaging and
Clinical Informatics. 2019.
[44] Qi Liu, Han Jiang, Tao Liu, Zihao Liu, Sicheng Li, Wujie Wen, and Yiyu
Shi. Defending deep learning-based biomedical image segmentation
from adversarial attacks: A low-cost frequency refinement approach. In
Medical Image Computing and Computer-Assisted Intervention (MIC-
CAI), 2020.
[45] Xingjun Ma, Yuhao Niu, Lin Gu, Yisen Wang, Yitian Zhao, James
Bailey, and Feng Lu. Understanding adversarial attacks on deep learning
based medical image analysis systems. Pattern Recognition, 2020.
[46] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris
Tsipras, and Adrian Vladu. Towards deep learning models resistant
to adversarial attacks. In ICLR, 2018.
[47] Chaithanya Kumar Mummadi, Thomas Brox, and Jan Hendrik Metzen.
Defending against universal perturbations with shared adversarial train-
ing. In ICCV, 2019.
[48] Utku Ozbulak, Arnout Van Messem, and Wesley De Neve. Impact of
adversarial examples on deep learning models for biomedical image
segmentation. In Medical Image Computing and Computer-Assisted
Intervention (MICCAI), 2019.
[49] Amber L. Simpson, Michela Antonelli, Spyridon Bakas, Michel Bilello,
Keyvan Farahani, Bram van Ginneken, Annette Kopp-Schneider, Ben-
nett A. Landman, Geert J. S. Litjens, Bjoern H. Menze, Olaf Ron-
neberger, Ronald M. Summers, Patrick Bilic, Patrick Ferdinand Christ,
Richard K. G. Do, Marc Gollub, Jennifer Golia-Pernicka, Stephan
Heckers, William R. Jarnagin, Maureen McHugo, Sandy Napel, Eugene
Vorontsov, Lena Maier-Hein, and M. Jorge Cardoso. A large annotated
medical image dataset for the development and evaluation of segmen-
tation algorithms. CoRR, abs/1902.09063, 2019.
[50] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna,
Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties
of neural networks. In ICLR, 2014.
[51] Cihang Xie, Mingxing Tan, Boqing Gong, Alan L. Yuille, and Quoc V.
Le. Smooth adversarial training. CoRR, abs/2006.14536, 2020.
[52] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie,
and Alan Yuille. Adversarial examples for semantic segmentation and
object detection. In ICCV, 2017.
11