Video to Video Translation CGAN

Video-to-Video Translation using
Cycle-Consistent Adversarial
Networks
Alessandro Calmanovici
Supervisor: Zhaopeng Cui
Computer Vision and Geometry Group
Institute for Visual Computing
ETH Zurich
November 27, 2019
Abstract
Image-to-image translation is the task of translating an image to
a different style or domain given paired or unpaired image examples
at training time. Video-to-video translation however is a harder task.
Translating a video means not only learning the structural features
and appearance of objects and different scenes but it also requires
realistic transitions and temporal consistent passages between con-
secutive frames. In this report we explore new ideas and approaches
to video-to-video translation using existing image-to-image transla-
tion networks, in particular CycleGANs. We investigate how a new
loss term of the network which takes into account the flow informa-
tion between two consecutive frames can improve the performance
on the task. We focus on a specific style transfer, which is translat-
ing a video from day to night and viceversa. We compare our results
to a baseline obtained by transferring day to night with a standard
CycleGAN to each frame of our dataset and propose further possible
optimizations of the model.
1

Contents
1 Introduction 4
2 CycleGAN 5
3 Proposed Methods 6
3.1 Baseline method . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Method with postprocessing . . . . . . . . . . . . . . . . . . 9
3.3 Flow-guided CycleGAN . . . . . . . . . . . . . . . . . . . . . 10
4 Conclusions 14
2

List of Figures
1 Dataset samples . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Baseline samples . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Consecutive baseline samples . . . . . . . . . . . . . . . . . . 8
4 Postprocessing samples . . . . . . . . . . . . . . . . . . . . . 10
5 CycleGAN with ﬂow loss samples . . . . . . . . . . . . . . . 11
6 First location training samples . . . . . . . . . . . . . . . . . 12
7 Second location training samples . . . . . . . . . . . . . . . 13
8 First location test samples . . . . . . . . . . . . . . . . . . . 13
9 Second location test samples . . . . . . . . . . . . . . . . . . 14
3

1 Introduction
Recently the image-to-image translation topic has experienced a big growth
and most of the approaches are based on deep neural networks. Here Gatys
et al [2] perform image style transfer using convolutional neural networks,
which manage to separate image content from style and obtain explicit
representations of semantic information. In particular, they show how to
produce new images of high perceptual quality that combine the content
of an arbitrary photograph with the appearance of numerous well known
artworks.
A similar attempt is [3], where the authors propose a new algorithm
for color transfer between images that have perceptually similar semantic
structure, optimizing a linear model with both local and global constraints.
Their method also exploits neural representations which are deep features
extracted from a CNN encoding.
Another original idea is exposed in [10]. They present an approach
(CycleGAN) for learning to translate an image from a source domain X
to a target domain Y in the absence of paired training examples. Their
results show good performance on several tasks where paired training data
does not exist, including collection style transfer, object transfiguration,
season transfer, photo enhancement, etc. Again, they use convolutional
neural networks and deep neural networks to extract features from the two
domains.
Video-to-video translation however is a harder problem. The direct
application of image-based approaches on videos may lead to a lot of in-
consistencies, with one of the major issues being the lack of explicit in-
formation about temporal constraints between images during the training
process. What’s more, most of image-based methods require paired data
and matching frames between videos is still an open problem.
In [6], Ruder et al. presented an approach that transfers the style from
one image (for example, a painting) to a whole video sequence. Processing
each frame of the video independently leads to flickering and false discon-
tinuities, since the solution of the style transfer task is not stable. To
tackle this problem, they introduce a temporal constraint that penalizes
deviations between two frames using the optical flow from the original
video. Besides that, they also initialize the optimization for the frame i
+ 1 with the stylized frame i. Very recently, Wang et al. [8] propose a novel
video-to-video synthesis approach under the generative adversarial learning
framework which requires paired input training data. They achieve high-
4

resolution, photo realistic, temporally coherent video results on a diverse
set of input formats including segmentation masks, sketches, and poses.
This project instead aims to present an approach to learn how to trans-
late a video from a source domain X to a target domain Y, without the
constraint of having precise paired X and Y inputs. The first steps are
attempts that consist in simply applying image-based approaches directly
on videos, using post-processing techniques and showing the poor results
obtained. Then we propose a good performing solution for the the video-to-
video translation task which can void frame matching between videos using
cycle-consistent adversarial networks. We show how to improve an image
based CycleGAN translation from domain A to domain B by proposing a
new loss term included in the net loss function. The new term makes use of
flow information between consecutive frames of a video to ensure a better
temporal consistency of the produced outputs.
2 CycleGAN
In this section I describe in more detail the CycleGAN architecture and
the general idea behind it. Most of what I write here is based on the orig-
inal paper [10]. The supporting structure of a CycleGAN, as the name
itself suggests, is a GAN, a Generative Adversarial Network. GANs have
achieved impressive results in image generation [1,5] , image editing [9], and
representation learning [4,5,7]. The main feature of GANs is the concept of
”adversarial loss” which guarantees - theoretically - the generated images
to be indistinguishable from the real ones. The structure of these nets is
based on a generator network and a discriminator network, who compete
against each other: the generator tries to trick the discriminator by creat-
ing images more and more similar to the real ones, while the discriminator
learns over time how to distinguish the real images from the generator false
ones. CycleGANs exploit the adversarial loss and implement another key
idea, the cycle consistency. The idea of using transitivity as a way to reg-
ularize structured data has a long history. In visual tracking, enforcing
simple forward-backward consistency has been a standard trick for decades
[44]. In this case the authors introduce a loop in the GAN architecture to
ensure that, starting from a false generated image, it is possible to retrieve
the original image which is the input of the net. The following image shows
the high level flow of the CycleGAN:
5

We are given one set of images in domain X and a diﬀerent set in domain
Y. Training a mapping G : X −→ Y such that the output y = G(x), x ∈
X, is indistinguishable from images y ∈ Y, “does not guarantee that the
individual inputs and outputs x and y are paired up in a meaningful way −
there are inﬁnitely many mappings G that will induce the same distribution
over y”. The model learns two mapping functions G: X ∈ Y and F: Y ∈ X
with their associated discriminators DY and DX. As described before, the
discriminators aim to distinguish a real image from a generated one. The
other two images, (b) and (c), represent the idea of the cycle-consistency
loss. When the input x is fed to the generator G, it produces a image
ˆy which is indistinguishable for DY from an original image belonging to
domain Y. In the same way, the generator F learns to transform an image
y into a fake ˆx. The cycle loss ensures that from the two fake images it is
still possible to go back to the original inputs. The generator G so is also
applied to ˆx to produce another y which should be as close as possible to
the original y. The same applies, viceversa, to generator F, so, in formula,
x −→ G(x) −→ F(G(x)) ≈ x and y −→ F(y) −→ G(F(y)) ≈ y
3 Proposed Methods
We tried several approaches to improve the quality of video translation
using a CycleGAN. We started with basic postprocessing methods applied
to CycleGAN outputs, which didn’t lead to good results. A good video
translation quality was instead obtained by directly modifying the structure
of the CycleGAN itself. Here we describe our experiments and show the
results for each of them.
The goal of the task is translating a video from day light to night as
better as possible according to human evaluation. All the experiments below
have therefore this purpose. The CycleGAN architecture used is the one
6

published in [11]1
. The dataset used for the experiments is composed by two
videos, both recorded with a phone, manually, 59 seconds long: the scene is
exactly the same for both, the ﬁrst one is recorded with day light and the
second one at night. In Figure 1 we show some sample images of the dataset.
The images are resized to 300x300 before being processed by the net, and
the outputs are 300x300 as well. The reason is that CycleGAN is quite
memory-intensive as four networks (two generators and two discriminators)
need to be loaded on one GPU, so a large image cannot be entirely loaded.
Figure 1: Dataset samples: ﬁrst raw shows daylight images and second raw
shows night images
3.1 Baseline method
Our baseline is the following. We extract all the frames from both day
and night videos, then we sample one random frame every 10 frames for
both of them, which results in 196 images for daylight and 196 images for
night. The sampled images are the training data for the CycleGAN. We
train the net on this data for 200 epochs. We get two new generators A
and B which have learned how to transform a day image into a night image
1
https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix
7

and vice versa. We then apply the generator A to all the frames extracted
from the day light video and create a new night video out of them, which
is compared to the original night video. In 2 we show 3 original images and
the correspondent translated ones, while in 3 we show 4 consecutive frames
translated by the CycleGAN.
Figure 2: Baseline samples: ﬁrst raw shows ground truth images and second
raw shows CycleGAN outputs
Figure 3: Consecutive CycleGAN samples considered as baseline.
8

3.2 Method with postprocessing
Now we show the performance of a postprocessing method based on optical
flows, which is the pattern of apparent motion of objects, surfaces, and
edges in a visual scene caused by the relative motion between an observer
and a scene. Our idea is the following. For each image of the translated
CycleGAN output, we compute the flow between that image I and the other
[I-n, I+n] images, where n determines the length of the window, w. Then
we replaced each pixel of I with the average of all the correspondent pixels
inside the window w, using the flow information.
The flow information between two images provides the correspondence
of the pixels between the two images. In a small video, these corresponding
pixels should have similar color. So for each pixel in a transferred image, we
can use these correspondence to find correspondent pixels in other trans-
ferred images. Then we can just compute the mean RGB value of all these
pixels to replace the original RGB value.
For example, suppose the size of image1 is [H, W], then the flow file will
contains a matrix F which has a size of [H, W, 2]. F(:, : , 0) encodes the
displacement in the X axis, and F(:, : ,1) encodes the displacement in the
Y axis. So for a pixel P1 located at [x1, y1] in image1, its correspondence
P2 in the image2 will be [x1+F(x1, y1, 0), y1+F(x1, y1, 1)]. We locate the
correspondent pixels of P1 for all the images inside the window w and then
we average them. The results with n = 4 and so w = 9 are shown in 4.
9

Figure 4: First raw shows CycleGAN baseline outputs and second raw shows
the same images after postprocessing.
We tried different window sizes but the results always resulted blurry
and not consistent, so we decided to focus on other ways to improve the
outputs of the network.
3.3 Flow-guided CycleGAN
We decided to investigate another approach, which would ensure a bet-
ter temporal consistency directly during the training process. To do so,
we used the optic flow information by embedding it inside the CycleGAN.
Normally the loss function of the CycleGAN is updated every time a new
image is processed. Instead, we process two images and update the loss
function at the end, in order to include inside the loss a part which de-
rives from the temporal relationship between two consecutive frames. In-
deed, we added the flow estimation loss into the CycleGAN. To do so,
we changed the input from single frame to two consecutive frames. For
example, we consider as input [D1, D2] and [N1, N2]. D1 and D2 are
two consecutive frames at daytime, and N1 and N2 are two consecutive
frames at nighttime. Then our loss function can be defined as L(G, F) =
CycleGANLoss(G, F)+E[|Flow(G(F(N2)), N1)−Flow(N2, N1)|], where
10

the last two losses are for the flows, which make sure that the genera-
tive frames maintain similar flows to the original frames. We create a tri-
angle of flow losses to improve the temporal consistency. Given N1 and
N2 original frames, N1’ and N2’ generated frames, we include in the loss
function 3 distances: |Flow(N1 , N2) − Flow(N1, N2)|, |Flow(N1, N2 ) −
Flow(N1, N2)|, and|Flow(N1 , N2 )−Flow(N1, N2)|. The results are shown
in figure 5.
Figure 5: First raw shows CycleGAN baseline outputs and second raw shows
the same images computed after adding the flow loss.
We highlighted different parts of the transferred images which have a
better level of detail than the baseline ones, but the main advantage of
our new method is to make the reconstructed video more stable. It is
not optimal to appreciate the much better video temporal consistency with
images, but the final result noticeably outperforms the baseline based on a
human eye judgment.
Finally we wanted to analyze the generalization of the net on unseen
data. We tried different training datasets and techniques, like fine-tuning
or early stopping, to achieve the same results on different scenes and envi-
ronments which are not part of the training data. Our experiments showed
very clearly that the net doesn’t perform well when training and test data
11

have different structures or represent different scenes. Even if training data
is composed by videos from only two different locations, the transferred
style learned by the net is somehow in between the two original ones and
performs poorly on the test data.
For this reason, we captured data from different locations which look
very similar in terms of street appearance, building design, trees and so on.
All of the training videos are taken from a point A to a point B, both for
daylight and night. Then we captured additional data from point B to C,
which we used for test data. We trained the net with 400 random frames
from each training video for 100 epochs. Training data is shown in Figures
6 and 7. The results are shown in Figures 8 and 9.
Figure 6: First raw shows daylight images from the training set in the first
location, second raw shows night images from the training set in the first
location
12

Figure 7: First raw shows daylight images from the training set in the
second location, second raw shows night images from the training set in the
second location
Figure 8: First raw shows images from the test set in the ﬁrst location,
second raw shows the correspondent transferred images using CycleGAN
with a ﬂow loss term
13

Figure 9: First raw shows images from the test set in the second location,
second raw shows the correspondent transferred images using CycleGAN
with a flow loss term
The daylight images are new to the net and the correspondent night
images are obtained by feeding them into the generator day to night. Our
conclusions are the following: we found that a big challenge in this task
is that the CycleGAN tends to overfit on certain particular features of the
training set and is not able to generalize well when the test data differ from
the training one. What’s more, when the training involves more than one
video and the videos don’t have similar structure, it fails to learn how to
transform a test video into its correct correspondent. In the latter case, it
seems that the net overfits on certain features from the different domains
and mix them all together during the test phase.
4 Conclusions
In this report we demonstrated how it is possible to achieve better perfor-
mance using a CycleGAN on the video to video translation task. There is
still a lot which can be improved. First of all, we didn’t tune any hyperpa-
rameter and we kept the original net architecture. There are other possible
ideas which can be applied using the flow information: it can be computed
initially and used as input to the net, for example. Another approach could
be feeding two or more images as a single tensor and assume that the net will
14

learn some internal representation about the temporal constraints between
the images. Other video to video translation papers have been published
and it may also be interesting to apply diﬀerent concepts extrapolated from
them into a CycleGAN, for example from [8].
References
[1] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative
image models using a laplacian pyramid of adversarial networks. In
Advances in neural information processing systems, pages 1486–1494,
2015.
[2] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style
transfer using convolutional neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages
2414–2423, 2016.
[3] Mingming He, Jing Liao, Lu Yuan, and Pedro V Sander. Neural color
transfer between images. arXiv preprint arXiv:1710.00756, 2017.
[4] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh,
Pablo Sprechmann, and Yann LeCun. Disentangling factors of varia-
tion in deep representation using adversarial training. In Advances in
Neural Information Processing Systems, pages 5040–5048, 2016.
[5] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised repre-
sentation learning with deep convolutional generative adversarial net-
works. arXiv preprint arXiv:1511.06434, 2015.
[6] Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. Artistic style
transfer for videos. In German Conference on Pattern Recognition,
pages 26–36. Springer, 2016.
[7] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec
Radford, and Xi Chen. Improved techniques for training gans. In
Advances in Neural Information Processing Systems, pages 2234–2242,
2016.
[8] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew
Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv
preprint arXiv:1808.06601, 2018.
15

[9] Jun-Yan Zhu, Philipp Kr¨ahenb¨uhl, Eli Shechtman, and Alexei A Efros.
Generative visual manipulation on the natural image manifold. In
European Conference on Computer Vision, pages 597–613. Springer,
2016.
[10] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Un-
paired image-to-image translation using cycle-consistent adversarial
networks. arXiv preprint, 2017.
[11] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Un-
paired image-to-image translation using cycle-consistent adversarial
networkss. In Computer Vision (ICCV), 2017 IEEE International
Conference on, 2017.
16

Video to Video Translation CGAN

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Video to Video Translation CGAN

Similar to Video to Video Translation CGAN (20)

Recently uploaded

Recently uploaded (20)

Video to Video Translation CGAN