StarGAN is a method for multi-domain image-to-image translation using a single model. It uses an adversarial loss with gradient penalty to train the discriminator. The generator is trained to translate images to different domains based on a target label, reconstruct the original image, and minimize classification and adversarial losses. StarGAN can be trained on multiple datasets by using mask vectors to ignore unknown domain labels. It achieves high quality image translation across different facial attributes and expressions.
2. Intro
▪ StarGAN is a method for Image-to-image translations for multiple domains using only a
single model, i.e. multi-domain image-to-image translation.
▪ Existing models are both inefficient and ineffective in such multi-domain image translation
tasks because in order to learn all mappings among k domains, k(k−1) generators have to
be trained.
▪ CelebA: The CelebFaces Attributes (CelebA) dataset
▪ RaFD: The Radboud Faces Database (RaFD)
4. 1. Discriminator structure and training
1. The discriminator uses Wasserstein GAN objective with gradient penalty
for its adversarial loss
ℒ 𝑎𝑑𝑣 = 𝔼 𝑥 𝐷𝑠𝑟𝑐 𝑥 − 𝔼 𝑥,𝑐[𝐷𝑠𝑟𝑐(𝐺(𝑥, 𝑐))] − 𝜆 𝑔𝑝 𝔼 𝑥 [ ∇ 𝑥 𝐷𝑠𝑟𝑐 𝑥 2
− 1
2
]
(Where 𝜆 𝑔𝑝 = 10 is a hyperparameter for the gradient penalty.)
2. Auxiliary classifier classifies images. (𝑐′: original label, 𝑥: input image)
ℒ 𝑐𝑙𝑠
𝑟
= 𝔼 𝑥,𝑐′ [−𝑙𝑜𝑔𝐷𝑐𝑙𝑠 (𝑐′|𝑥)] : Classification Loss for real images
ℒ 𝑐𝑙𝑠
𝑓
= 𝔼 𝑥,𝑐 −𝑙𝑜𝑔𝐷𝑐𝑙𝑠 𝑐 𝐺 𝑥, 𝑐 : Classification loss for fake images
ℒ 𝑟𝑒𝑐 = 𝔼 𝑥,𝑐,𝑐′ [ 𝑥 − 𝐺 𝐺 𝑥, 𝑐 , 𝑐′
1
] : Reconstruction Loss, L1 norm
ℒ 𝐷 = −ℒ 𝑎𝑑𝑣 + 𝜆 𝑐𝑙𝑠ℒ 𝑐𝑙𝑠
𝑟
: Discriminator Loss
(Where 𝜆 𝑐𝑙𝑠 = 1 is a hyperparameter for deciding the importance of
classification loss versus adversarial loss.)
5. 2. Generator structure and training
▪ The input image is concatenated with the target domain
label (after it has been spatially multiplicated to fit).
▪ The generated fake image is concatenated with the original
domain to generate a reconstructed image of the original
domain.
▪ A cycle consistency loss is calculated from the input image
and the reconstructed image.
ℒ 𝑟𝑒𝑐 = 𝔼 𝑥,𝑐,𝑐′ = [ 𝑥 − 𝐺 𝐺 𝑥, 𝑐 , 𝑐′
1
]
▪ The resulting Generator Loss is
ℒ 𝐺 = ℒ 𝑎𝑑𝑣 + 𝜆 𝑐𝑙𝑠ℒ 𝑐𝑙𝑠
𝑓
+ 𝜆 𝑟𝑒𝑐ℒ 𝑟𝑒𝑐
(Where 𝜆 𝑐𝑙𝑠 = 1, 𝜆 𝑟𝑒𝑐 = 10)
6. Results of StarGAN-SNG
The model is trained on the CelebA and RaFD datasets individually. Afterwords, images from
the CelebA dataset are used to transfer features learned during training.
7. Comparison with other GANs
The superiority of StarGAN in the image
quality is perhaps because it can use all
images from all available domains for its
training, instead of only the images of the
original and target domains.
The regularization effect of StarGAN through a
multi-task learning framework allows it to learn
reliable features universally applicable to
multiple domains of images with different
facial attribute values, rather than training a
model to perform a fixed translation, which is
prone to overfitting.
9. Training with Multiple Datasets: Mask Vectors
▪ When using multiple datasets, e.g. CelebA and RaFD, the label information is only partially
known to each dataset.
▪ This is problematic because the complete information on the label vector c’ is required when
reconstructing the input image x from the translated image G(x, c).
▪ StarGAN uses a mask vector m, an n-dimensional one-hot vector, that allows it to ignore
unspecified labels and focus on the explicitly known label provided by a particular dataset.
▪ The resulting label vector is 𝑐 = 𝑐1, . . . , 𝑐 𝑛, 𝑚 , where unknown values are assigned 0 values.
▪ The discriminator tries to minimize only the classification error associated to the known label.
10. Results of StarGAN-JNT
StarGAN-JNT exhibits emotional
expressions with high visual quality, while
StarGAN-SNG generates reasonable but
blurry images with gray backgrounds.
This is probably because StarGAN-JNT
can leverage both CelebA and RaFD
datasets to improve shared low-level tasks
such facial keypoint detection and
segmentation, wheras StarGAN-SNG does
not learn to translate CelebA images during
training.
11. Network Architecture:
The Generator
IN: Instance Normalization
Similar to batch normalization but
normalization is done for each sample,
not the entire batch.
In convolutional networks, every
channel of every sample gets
normalized as a unit and there are n*c
normalizations performed per batch, (n:
number of samples in batch, c: number
of channels in layer input).
12. Network Architecture:
The Discriminator
Rather complicated hyperparameter
tuning and learning rate scheduling
is used.
The Discriminator is a PatchGAN network, which classifies whether local image patches
are real or fake.
In all Leaky ReLU activations, 𝛽 = −0.01
All models are trained using an Adam optimizer with 𝛽1 = 0.5, 𝛽2 = 0.999.
Batch size is set at 16 for all experiments and all images are randomly flipped
horizontally with probability of 0.5
On CelebA, the learning rate is 1e-4 for the first 10 epochs and linearly decayed to 0
over the next 10 epochs.
On RaFD, the learning rate is 1e-4 for 100 epochs and linearly decayed to 0 over the
next 100 epochs. This is because there is less data in RaFD than in CelebA.