SlideShare a Scribd company logo
Video-to-Video Translation using
Cycle-Consistent Adversarial
Alessandro Calmanovici
Supervisor: Zhaopeng Cui
Computer Vision and Geometry Group
Institute for Visual Computing
ETH Zurich
November 27, 2019
Image-to-image translation is the task of translating an image to
a different style or domain given paired or unpaired image examples
at training time. Video-to-video translation however is a harder task.
Translating a video means not only learning the structural features
and appearance of objects and different scenes but it also requires
realistic transitions and temporal consistent passages between con-
secutive frames. In this report we explore new ideas and approaches
to video-to-video translation using existing image-to-image transla-
tion networks, in particular CycleGANs. We investigate how a new
loss term of the network which takes into account the flow informa-
tion between two consecutive frames can improve the performance
on the task. We focus on a specific style transfer, which is translat-
ing a video from day to night and viceversa. We compare our results
to a baseline obtained by transferring day to night with a standard
CycleGAN to each frame of our dataset and propose further possible
optimizations of the model.
1 Introduction 4
2 CycleGAN 5
3 Proposed Methods 6
3.1 Baseline method . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Method with postprocessing . . . . . . . . . . . . . . . . . . 9
3.3 Flow-guided CycleGAN . . . . . . . . . . . . . . . . . . . . . 10
4 Conclusions 14
List of Figures
1 Dataset samples . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Baseline samples . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Consecutive baseline samples . . . . . . . . . . . . . . . . . . 8
4 Postprocessing samples . . . . . . . . . . . . . . . . . . . . . 10
5 CycleGAN with flow loss samples . . . . . . . . . . . . . . . 11
6 First location training samples . . . . . . . . . . . . . . . . . 12
7 Second location training samples . . . . . . . . . . . . . . . 13
8 First location test samples . . . . . . . . . . . . . . . . . . . 13
9 Second location test samples . . . . . . . . . . . . . . . . . . 14
1 Introduction
Recently the image-to-image translation topic has experienced a big growth
and most of the approaches are based on deep neural networks. Here Gatys
et al [2] perform image style transfer using convolutional neural networks,
which manage to separate image content from style and obtain explicit
representations of semantic information. In particular, they show how to
produce new images of high perceptual quality that combine the content
of an arbitrary photograph with the appearance of numerous well known
A similar attempt is [3], where the authors propose a new algorithm
for color transfer between images that have perceptually similar semantic
structure, optimizing a linear model with both local and global constraints.
Their method also exploits neural representations which are deep features
extracted from a CNN encoding.
Another original idea is exposed in [10]. They present an approach
(CycleGAN) for learning to translate an image from a source domain X
to a target domain Y in the absence of paired training examples. Their
results show good performance on several tasks where paired training data
does not exist, including collection style transfer, object transfiguration,
season transfer, photo enhancement, etc. Again, they use convolutional
neural networks and deep neural networks to extract features from the two
Video-to-video translation however is a harder problem. The direct
application of image-based approaches on videos may lead to a lot of in-
consistencies, with one of the major issues being the lack of explicit in-
formation about temporal constraints between images during the training
process. What’s more, most of image-based methods require paired data
and matching frames between videos is still an open problem.
In [6], Ruder et al. presented an approach that transfers the style from
one image (for example, a painting) to a whole video sequence. Processing
each frame of the video independently leads to flickering and false discon-
tinuities, since the solution of the style transfer task is not stable. To
tackle this problem, they introduce a temporal constraint that penalizes
deviations between two frames using the optical flow from the original
video. Besides that, they also initialize the optimization for the frame i
+ 1 with the stylized frame i. Very recently, Wang et al. [8] propose a novel
video-to-video synthesis approach under the generative adversarial learning
framework which requires paired input training data. They achieve high-
resolution, photo realistic, temporally coherent video results on a diverse
set of input formats including segmentation masks, sketches, and poses.
This project instead aims to present an approach to learn how to trans-
late a video from a source domain X to a target domain Y, without the
constraint of having precise paired X and Y inputs. The first steps are
attempts that consist in simply applying image-based approaches directly
on videos, using post-processing techniques and showing the poor results
obtained. Then we propose a good performing solution for the the video-to-
video translation task which can void frame matching between videos using
cycle-consistent adversarial networks. We show how to improve an image
based CycleGAN translation from domain A to domain B by proposing a
new loss term included in the net loss function. The new term makes use of
flow information between consecutive frames of a video to ensure a better
temporal consistency of the produced outputs.
2 CycleGAN
In this section I describe in more detail the CycleGAN architecture and
the general idea behind it. Most of what I write here is based on the orig-
inal paper [10]. The supporting structure of a CycleGAN, as the name
itself suggests, is a GAN, a Generative Adversarial Network. GANs have
achieved impressive results in image generation [1,5] , image editing [9], and
representation learning [4,5,7]. The main feature of GANs is the concept of
”adversarial loss” which guarantees - theoretically - the generated images
to be indistinguishable from the real ones. The structure of these nets is
based on a generator network and a discriminator network, who compete
against each other: the generator tries to trick the discriminator by creat-
ing images more and more similar to the real ones, while the discriminator
learns over time how to distinguish the real images from the generator false
ones. CycleGANs exploit the adversarial loss and implement another key
idea, the cycle consistency. The idea of using transitivity as a way to reg-
ularize structured data has a long history. In visual tracking, enforcing
simple forward-backward consistency has been a standard trick for decades
[44]. In this case the authors introduce a loop in the GAN architecture to
ensure that, starting from a false generated image, it is possible to retrieve
the original image which is the input of the net. The following image shows
the high level flow of the CycleGAN:
We are given one set of images in domain X and a different set in domain
Y. Training a mapping G : X −→ Y such that the output y = G(x), x ∈
X, is indistinguishable from images y ∈ Y, “does not guarantee that the
individual inputs and outputs x and y are paired up in a meaningful way −
there are infinitely many mappings G that will induce the same distribution
over y”. The model learns two mapping functions G: X ∈ Y and F: Y ∈ X
with their associated discriminators DY and DX. As described before, the
discriminators aim to distinguish a real image from a generated one. The
other two images, (b) and (c), represent the idea of the cycle-consistency
loss. When the input x is fed to the generator G, it produces a image
ˆy which is indistinguishable for DY from an original image belonging to
domain Y. In the same way, the generator F learns to transform an image
y into a fake ˆx. The cycle loss ensures that from the two fake images it is
still possible to go back to the original inputs. The generator G so is also
applied to ˆx to produce another y which should be as close as possible to
the original y. The same applies, viceversa, to generator F, so, in formula,
x −→ G(x) −→ F(G(x)) ≈ x and y −→ F(y) −→ G(F(y)) ≈ y
3 Proposed Methods
We tried several approaches to improve the quality of video translation
using a CycleGAN. We started with basic postprocessing methods applied
to CycleGAN outputs, which didn’t lead to good results. A good video
translation quality was instead obtained by directly modifying the structure
of the CycleGAN itself. Here we describe our experiments and show the
results for each of them.
The goal of the task is translating a video from day light to night as
better as possible according to human evaluation. All the experiments below
have therefore this purpose. The CycleGAN architecture used is the one
published in [11]1
. The dataset used for the experiments is composed by two
videos, both recorded with a phone, manually, 59 seconds long: the scene is
exactly the same for both, the first one is recorded with day light and the
second one at night. In Figure 1 we show some sample images of the dataset.
The images are resized to 300x300 before being processed by the net, and
the outputs are 300x300 as well. The reason is that CycleGAN is quite
memory-intensive as four networks (two generators and two discriminators)
need to be loaded on one GPU, so a large image cannot be entirely loaded.
Figure 1: Dataset samples: first raw shows daylight images and second raw
shows night images
3.1 Baseline method
Our baseline is the following. We extract all the frames from both day
and night videos, then we sample one random frame every 10 frames for
both of them, which results in 196 images for daylight and 196 images for
night. The sampled images are the training data for the CycleGAN. We
train the net on this data for 200 epochs. We get two new generators A
and B which have learned how to transform a day image into a night image
and vice versa. We then apply the generator A to all the frames extracted
from the day light video and create a new night video out of them, which
is compared to the original night video. In 2 we show 3 original images and
the correspondent translated ones, while in 3 we show 4 consecutive frames
translated by the CycleGAN.
Figure 2: Baseline samples: first raw shows ground truth images and second
raw shows CycleGAN outputs
Figure 3: Consecutive CycleGAN samples considered as baseline.
3.2 Method with postprocessing
Now we show the performance of a postprocessing method based on optical
flows, which is the pattern of apparent motion of objects, surfaces, and
edges in a visual scene caused by the relative motion between an observer
and a scene. Our idea is the following. For each image of the translated
CycleGAN output, we compute the flow between that image I and the other
[I-n, I+n] images, where n determines the length of the window, w. Then
we replaced each pixel of I with the average of all the correspondent pixels
inside the window w, using the flow information.
The flow information between two images provides the correspondence
of the pixels between the two images. In a small video, these corresponding
pixels should have similar color. So for each pixel in a transferred image, we
can use these correspondence to find correspondent pixels in other trans-
ferred images. Then we can just compute the mean RGB value of all these
pixels to replace the original RGB value.
For example, suppose the size of image1 is [H, W], then the flow file will
contains a matrix F which has a size of [H, W, 2]. F(:, : , 0) encodes the
displacement in the X axis, and F(:, : ,1) encodes the displacement in the
Y axis. So for a pixel P1 located at [x1, y1] in image1, its correspondence
P2 in the image2 will be [x1+F(x1, y1, 0), y1+F(x1, y1, 1)]. We locate the
correspondent pixels of P1 for all the images inside the window w and then
we average them. The results with n = 4 and so w = 9 are shown in 4.
Figure 4: First raw shows CycleGAN baseline outputs and second raw shows
the same images after postprocessing.
We tried different window sizes but the results always resulted blurry
and not consistent, so we decided to focus on other ways to improve the
outputs of the network.
3.3 Flow-guided CycleGAN
We decided to investigate another approach, which would ensure a bet-
ter temporal consistency directly during the training process. To do so,
we used the optic flow information by embedding it inside the CycleGAN.
Normally the loss function of the CycleGAN is updated every time a new
image is processed. Instead, we process two images and update the loss
function at the end, in order to include inside the loss a part which de-
rives from the temporal relationship between two consecutive frames. In-
deed, we added the flow estimation loss into the CycleGAN. To do so,
we changed the input from single frame to two consecutive frames. For
example, we consider as input [D1, D2] and [N1, N2]. D1 and D2 are
two consecutive frames at daytime, and N1 and N2 are two consecutive
frames at nighttime. Then our loss function can be defined as L(G, F) =
CycleGANLoss(G, F)+E[|Flow(G(F(N2)), N1)−Flow(N2, N1)|], where
the last two losses are for the flows, which make sure that the genera-
tive frames maintain similar flows to the original frames. We create a tri-
angle of flow losses to improve the temporal consistency. Given N1 and
N2 original frames, N1’ and N2’ generated frames, we include in the loss
function 3 distances: |Flow(N1 , N2) − Flow(N1, N2)|, |Flow(N1, N2 ) −
Flow(N1, N2)|, and|Flow(N1 , N2 )−Flow(N1, N2)|. The results are shown
in figure 5.
Figure 5: First raw shows CycleGAN baseline outputs and second raw shows
the same images computed after adding the flow loss.
We highlighted different parts of the transferred images which have a
better level of detail than the baseline ones, but the main advantage of
our new method is to make the reconstructed video more stable. It is
not optimal to appreciate the much better video temporal consistency with
images, but the final result noticeably outperforms the baseline based on a
human eye judgment.
Finally we wanted to analyze the generalization of the net on unseen
data. We tried different training datasets and techniques, like fine-tuning
or early stopping, to achieve the same results on different scenes and envi-
ronments which are not part of the training data. Our experiments showed
very clearly that the net doesn’t perform well when training and test data
have different structures or represent different scenes. Even if training data
is composed by videos from only two different locations, the transferred
style learned by the net is somehow in between the two original ones and
performs poorly on the test data.
For this reason, we captured data from different locations which look
very similar in terms of street appearance, building design, trees and so on.
All of the training videos are taken from a point A to a point B, both for
daylight and night. Then we captured additional data from point B to C,
which we used for test data. We trained the net with 400 random frames
from each training video for 100 epochs. Training data is shown in Figures
6 and 7. The results are shown in Figures 8 and 9.
Figure 6: First raw shows daylight images from the training set in the first
location, second raw shows night images from the training set in the first
Figure 7: First raw shows daylight images from the training set in the
second location, second raw shows night images from the training set in the
second location
Figure 8: First raw shows images from the test set in the first location,
second raw shows the correspondent transferred images using CycleGAN
with a flow loss term
Figure 9: First raw shows images from the test set in the second location,
second raw shows the correspondent transferred images using CycleGAN
with a flow loss term
The daylight images are new to the net and the correspondent night
images are obtained by feeding them into the generator day to night. Our
conclusions are the following: we found that a big challenge in this task
is that the CycleGAN tends to overfit on certain particular features of the
training set and is not able to generalize well when the test data differ from
the training one. What’s more, when the training involves more than one
video and the videos don’t have similar structure, it fails to learn how to
transform a test video into its correct correspondent. In the latter case, it
seems that the net overfits on certain features from the different domains
and mix them all together during the test phase.
4 Conclusions
In this report we demonstrated how it is possible to achieve better perfor-
mance using a CycleGAN on the video to video translation task. There is
still a lot which can be improved. First of all, we didn’t tune any hyperpa-
rameter and we kept the original net architecture. There are other possible
ideas which can be applied using the flow information: it can be computed
initially and used as input to the net, for example. Another approach could
be feeding two or more images as a single tensor and assume that the net will
learn some internal representation about the temporal constraints between
the images. Other video to video translation papers have been published
and it may also be interesting to apply different concepts extrapolated from
them into a CycleGAN, for example from [8].
[1] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative
image models using a laplacian pyramid of adversarial networks. In
Advances in neural information processing systems, pages 1486–1494,
[2] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style
transfer using convolutional neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages
2414–2423, 2016.
[3] Mingming He, Jing Liao, Lu Yuan, and Pedro V Sander. Neural color
transfer between images. arXiv preprint arXiv:1710.00756, 2017.
[4] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh,
Pablo Sprechmann, and Yann LeCun. Disentangling factors of varia-
tion in deep representation using adversarial training. In Advances in
Neural Information Processing Systems, pages 5040–5048, 2016.
[5] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised repre-
sentation learning with deep convolutional generative adversarial net-
works. arXiv preprint arXiv:1511.06434, 2015.
[6] Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. Artistic style
transfer for videos. In German Conference on Pattern Recognition,
pages 26–36. Springer, 2016.
[7] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec
Radford, and Xi Chen. Improved techniques for training gans. In
Advances in Neural Information Processing Systems, pages 2234–2242,
[8] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew
Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv
preprint arXiv:1808.06601, 2018.
[9] Jun-Yan Zhu, Philipp Kr¨ahenb¨uhl, Eli Shechtman, and Alexei A Efros.
Generative visual manipulation on the natural image manifold. In
European Conference on Computer Vision, pages 597–613. Springer,
[10] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Un-
paired image-to-image translation using cycle-consistent adversarial
networks. arXiv preprint, 2017.
[11] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Un-
paired image-to-image translation using cycle-consistent adversarial
networkss. In Computer Vision (ICCV), 2017 IEEE International
Conference on, 2017.

More Related Content

What's hot

The Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention ModelsThe Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention Models
Universitat Politècnica de Catalunya
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
광희 이
Intel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
Intel, Intelligent Systems Lab: Syable View Synthesis WhitepaperIntel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
Intel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
Alejandro Franceschi
A systematic image compression in the combination of linear vector quantisati...
A systematic image compression in the combination of linear vector quantisati...A systematic image compression in the combination of linear vector quantisati...
A systematic image compression in the combination of linear vector quantisati...
eSAT Publishing House
Intel ILS: Enhancing Photorealism Enhancement
Intel ILS: Enhancing Photorealism EnhancementIntel ILS: Enhancing Photorealism Enhancement
Intel ILS: Enhancing Photorealism Enhancement
Alejandro Franceschi
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...
Dr. Amarjeet Singh
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Alejandro Franceschi
Hybrid compression based stationary wavelet transforms
Hybrid compression based stationary wavelet transformsHybrid compression based stationary wavelet transforms
Hybrid compression based stationary wavelet transforms
Omar Ghazi
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
Low complexity features for jpeg steganalysis using undecimated dct
Low complexity features for jpeg steganalysis using undecimated dctLow complexity features for jpeg steganalysis using undecimated dct
Low complexity features for jpeg steganalysis using undecimated dct
Pvrtechnologies Nellore
Group saliency propagation for large scale and quick image co segmentation
Group saliency propagation for large scale and quick image co segmentationGroup saliency propagation for large scale and quick image co segmentation
Group saliency propagation for large scale and quick image co segmentation
Koteswar Rao Jerripothula
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
IJCSIS Research Publications
IRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature DescriptorIRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET Journal
Performance Comparison of K-means Codebook Optimization using different Clust...
Performance Comparison of K-means Codebook Optimization using different Clust...Performance Comparison of K-means Codebook Optimization using different Clust...
Performance Comparison of K-means Codebook Optimization using different Clust...
IOSR Journals
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Universitat Politècnica de Catalunya
IRJET- Digital Watermarking using Integration of DWT & SVD Techniques
IRJET- Digital Watermarking using Integration of DWT & SVD TechniquesIRJET- Digital Watermarking using Integration of DWT & SVD Techniques
IRJET- Digital Watermarking using Integration of DWT & SVD Techniques
IRJET Journal

What's hot (20)

The Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention ModelsThe Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention Models
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
PR-065 : High-Resolution Image Synthesis and Semantic Manipulation with Condi...
Intel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
Intel, Intelligent Systems Lab: Syable View Synthesis WhitepaperIntel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
Intel, Intelligent Systems Lab: Syable View Synthesis Whitepaper
A systematic image compression in the combination of linear vector quantisati...
A systematic image compression in the combination of linear vector quantisati...A systematic image compression in the combination of linear vector quantisati...
A systematic image compression in the combination of linear vector quantisati...
Intel ILS: Enhancing Photorealism Enhancement
Intel ILS: Enhancing Photorealism EnhancementIntel ILS: Enhancing Photorealism Enhancement
Intel ILS: Enhancing Photorealism Enhancement
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...
Hybrid compression based stationary wavelet transforms
Hybrid compression based stationary wavelet transformsHybrid compression based stationary wavelet transforms
Hybrid compression based stationary wavelet transforms
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Low complexity features for jpeg steganalysis using undecimated dct
Low complexity features for jpeg steganalysis using undecimated dctLow complexity features for jpeg steganalysis using undecimated dct
Low complexity features for jpeg steganalysis using undecimated dct
Group saliency propagation for large scale and quick image co segmentation
Group saliency propagation for large scale and quick image co segmentationGroup saliency propagation for large scale and quick image co segmentation
Group saliency propagation for large scale and quick image co segmentation
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
Fractal Compression of an AVI Video File using DWT and Particle Swarm Optimiz...
IRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature DescriptorIRJET- Crowd Density Estimation using Novel Feature Descriptor
IRJET- Crowd Density Estimation using Novel Feature Descriptor
Performance Comparison of K-means Codebook Optimization using different Clust...
Performance Comparison of K-means Codebook Optimization using different Clust...Performance Comparison of K-means Codebook Optimization using different Clust...
Performance Comparison of K-means Codebook Optimization using different Clust...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
IRJET- Digital Watermarking using Integration of DWT & SVD Techniques
IRJET- Digital Watermarking using Integration of DWT & SVD TechniquesIRJET- Digital Watermarking using Integration of DWT & SVD Techniques
IRJET- Digital Watermarking using Integration of DWT & SVD Techniques

Similar to Video to Video Translation CGAN

Google | Infinite Nature Zero Whitepaper
Google | Infinite Nature Zero WhitepaperGoogle | Infinite Nature Zero Whitepaper
Google | Infinite Nature Zero Whitepaper
Alejandro Franceschi
IRJET- Transformation of Realistic Images and Videos into Cartoon Images and ...
IRJET- Transformation of Realistic Images and Videos into Cartoon Images and ...IRJET- Transformation of Realistic Images and Videos into Cartoon Images and ...
IRJET- Transformation of Realistic Images and Videos into Cartoon Images and ...
IRJET Journal
Cartoonization of images using machine Learning
Cartoonization of images using machine LearningCartoonization of images using machine Learning
Cartoonization of images using machine Learning
IRJET Journal
Unpaired Image Translations Using GANs: A Review
Unpaired Image Translations Using GANs: A ReviewUnpaired Image Translations Using GANs: A Review
Unpaired Image Translations Using GANs: A Review
IRJET Journal
Decomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesisDecomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesis
Naeem Shehzad
Multiple Style-Transfer in Real-Time
Multiple Style-Transfer in Real-TimeMultiple Style-Transfer in Real-Time
Multiple Style-Transfer in Real-Time
IRJET - Applications of Image and Video Deduplication: A Survey
IRJET -  	  Applications of Image and Video Deduplication: A SurveyIRJET -  	  Applications of Image and Video Deduplication: A Survey
IRJET - Applications of Image and Video Deduplication: A Survey
IRJET Journal
Implementing Neural Style Transfer
Implementing Neural Style Transfer Implementing Neural Style Transfer
Implementing Neural Style Transfer
Tahsin Mayeesha
Recognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenesRecognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenes
IJCSEA Journal
IJCER ( International Journal of computational Engineerin...
IJCER ( International Journal of computational Engineerin...IJCER ( International Journal of computational Engineerin...
IJCER ( International Journal of computational Engineerin...ijceronline
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
Performance analysis on color image mosaicing techniques on FPGA
Performance analysis on color image mosaicing techniques on FPGAPerformance analysis on color image mosaicing techniques on FPGA
Performance analysis on color image mosaicing techniques on FPGA
deep_stereo_arxiv_2015Ivan Neulander
Real Time Sign Language Recognition Using Deep Learning
Real Time Sign Language Recognition Using Deep LearningReal Time Sign Language Recognition Using Deep Learning
Real Time Sign Language Recognition Using Deep Learning
IRJET Journal
An improved image compression algorithm based on daubechies wavelets with ar...
An improved image compression algorithm based on daubechies  wavelets with ar...An improved image compression algorithm based on daubechies  wavelets with ar...
An improved image compression algorithm based on daubechies wavelets with ar...
Alexander Decker
Log polar coordinates
Log polar coordinatesLog polar coordinates
Log polar coordinates
Oğul Göçmen
Automated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleAutomated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired People
Christopher Mehdi Elamri
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Universitat Politècnica de Catalunya
IJCER ( International Journal of computational Engineeri...
 IJCER ( International Journal of computational Engineeri... IJCER ( International Journal of computational Engineeri...
IJCER ( International Journal of computational Engineeri...

Similar to Video to Video Translation CGAN (20)

Google | Infinite Nature Zero Whitepaper
Google | Infinite Nature Zero WhitepaperGoogle | Infinite Nature Zero Whitepaper
Google | Infinite Nature Zero Whitepaper
IRJET- Transformation of Realistic Images and Videos into Cartoon Images and ...
IRJET- Transformation of Realistic Images and Videos into Cartoon Images and ...IRJET- Transformation of Realistic Images and Videos into Cartoon Images and ...
IRJET- Transformation of Realistic Images and Videos into Cartoon Images and ...
Cartoonization of images using machine Learning
Cartoonization of images using machine LearningCartoonization of images using machine Learning
Cartoonization of images using machine Learning
Unpaired Image Translations Using GANs: A Review
Unpaired Image Translations Using GANs: A ReviewUnpaired Image Translations Using GANs: A Review
Unpaired Image Translations Using GANs: A Review
Decomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesisDecomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesis
Multiple Style-Transfer in Real-Time
Multiple Style-Transfer in Real-TimeMultiple Style-Transfer in Real-Time
Multiple Style-Transfer in Real-Time
IRJET - Applications of Image and Video Deduplication: A Survey
IRJET -  	  Applications of Image and Video Deduplication: A SurveyIRJET -  	  Applications of Image and Video Deduplication: A Survey
IRJET - Applications of Image and Video Deduplication: A Survey
Implementing Neural Style Transfer
Implementing Neural Style Transfer Implementing Neural Style Transfer
Implementing Neural Style Transfer
Recognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenesRecognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenes
IJCER ( International Journal of computational Engineerin...
IJCER ( International Journal of computational Engineerin...IJCER ( International Journal of computational Engineerin...
IJCER ( International Journal of computational Engineerin...
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
Performance analysis on color image mosaicing techniques on FPGA
Performance analysis on color image mosaicing techniques on FPGAPerformance analysis on color image mosaicing techniques on FPGA
Performance analysis on color image mosaicing techniques on FPGA
Real Time Sign Language Recognition Using Deep Learning
Real Time Sign Language Recognition Using Deep LearningReal Time Sign Language Recognition Using Deep Learning
Real Time Sign Language Recognition Using Deep Learning
An improved image compression algorithm based on daubechies wavelets with ar...
An improved image compression algorithm based on daubechies  wavelets with ar...An improved image compression algorithm based on daubechies  wavelets with ar...
An improved image compression algorithm based on daubechies wavelets with ar...
Log polar coordinates
Log polar coordinatesLog polar coordinates
Log polar coordinates
Automated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleAutomated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired People
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
IJCER ( International Journal of computational Engineeri...
 IJCER ( International Journal of computational Engineeri... IJCER ( International Journal of computational Engineeri...
IJCER ( International Journal of computational Engineeri...

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect

Video to Video Translation CGAN

  • 1. Video-to-Video Translation using Cycle-Consistent Adversarial Networks Alessandro Calmanovici Supervisor: Zhaopeng Cui Computer Vision and Geometry Group Institute for Visual Computing ETH Zurich November 27, 2019 Abstract Image-to-image translation is the task of translating an image to a different style or domain given paired or unpaired image examples at training time. Video-to-video translation however is a harder task. Translating a video means not only learning the structural features and appearance of objects and different scenes but it also requires realistic transitions and temporal consistent passages between con- secutive frames. In this report we explore new ideas and approaches to video-to-video translation using existing image-to-image transla- tion networks, in particular CycleGANs. We investigate how a new loss term of the network which takes into account the flow informa- tion between two consecutive frames can improve the performance on the task. We focus on a specific style transfer, which is translat- ing a video from day to night and viceversa. We compare our results to a baseline obtained by transferring day to night with a standard CycleGAN to each frame of our dataset and propose further possible optimizations of the model. 1
  • 2. Contents 1 Introduction 4 2 CycleGAN 5 3 Proposed Methods 6 3.1 Baseline method . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Method with postprocessing . . . . . . . . . . . . . . . . . . 9 3.3 Flow-guided CycleGAN . . . . . . . . . . . . . . . . . . . . . 10 4 Conclusions 14 2
  • 3. List of Figures 1 Dataset samples . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Baseline samples . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Consecutive baseline samples . . . . . . . . . . . . . . . . . . 8 4 Postprocessing samples . . . . . . . . . . . . . . . . . . . . . 10 5 CycleGAN with flow loss samples . . . . . . . . . . . . . . . 11 6 First location training samples . . . . . . . . . . . . . . . . . 12 7 Second location training samples . . . . . . . . . . . . . . . 13 8 First location test samples . . . . . . . . . . . . . . . . . . . 13 9 Second location test samples . . . . . . . . . . . . . . . . . . 14 3
  • 4. 1 Introduction Recently the image-to-image translation topic has experienced a big growth and most of the approaches are based on deep neural networks. Here Gatys et al [2] perform image style transfer using convolutional neural networks, which manage to separate image content from style and obtain explicit representations of semantic information. In particular, they show how to produce new images of high perceptual quality that combine the content of an arbitrary photograph with the appearance of numerous well known artworks. A similar attempt is [3], where the authors propose a new algorithm for color transfer between images that have perceptually similar semantic structure, optimizing a linear model with both local and global constraints. Their method also exploits neural representations which are deep features extracted from a CNN encoding. Another original idea is exposed in [10]. They present an approach (CycleGAN) for learning to translate an image from a source domain X to a target domain Y in the absence of paired training examples. Their results show good performance on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Again, they use convolutional neural networks and deep neural networks to extract features from the two domains. Video-to-video translation however is a harder problem. The direct application of image-based approaches on videos may lead to a lot of in- consistencies, with one of the major issues being the lack of explicit in- formation about temporal constraints between images during the training process. What’s more, most of image-based methods require paired data and matching frames between videos is still an open problem. In [6], Ruder et al. presented an approach that transfers the style from one image (for example, a painting) to a whole video sequence. Processing each frame of the video independently leads to flickering and false discon- tinuities, since the solution of the style transfer task is not stable. To tackle this problem, they introduce a temporal constraint that penalizes deviations between two frames using the optical flow from the original video. Besides that, they also initialize the optimization for the frame i + 1 with the stylized frame i. Very recently, Wang et al. [8] propose a novel video-to-video synthesis approach under the generative adversarial learning framework which requires paired input training data. They achieve high- 4
  • 5. resolution, photo realistic, temporally coherent video results on a diverse set of input formats including segmentation masks, sketches, and poses. This project instead aims to present an approach to learn how to trans- late a video from a source domain X to a target domain Y, without the constraint of having precise paired X and Y inputs. The first steps are attempts that consist in simply applying image-based approaches directly on videos, using post-processing techniques and showing the poor results obtained. Then we propose a good performing solution for the the video-to- video translation task which can void frame matching between videos using cycle-consistent adversarial networks. We show how to improve an image based CycleGAN translation from domain A to domain B by proposing a new loss term included in the net loss function. The new term makes use of flow information between consecutive frames of a video to ensure a better temporal consistency of the produced outputs. 2 CycleGAN In this section I describe in more detail the CycleGAN architecture and the general idea behind it. Most of what I write here is based on the orig- inal paper [10]. The supporting structure of a CycleGAN, as the name itself suggests, is a GAN, a Generative Adversarial Network. GANs have achieved impressive results in image generation [1,5] , image editing [9], and representation learning [4,5,7]. The main feature of GANs is the concept of ”adversarial loss” which guarantees - theoretically - the generated images to be indistinguishable from the real ones. The structure of these nets is based on a generator network and a discriminator network, who compete against each other: the generator tries to trick the discriminator by creat- ing images more and more similar to the real ones, while the discriminator learns over time how to distinguish the real images from the generator false ones. CycleGANs exploit the adversarial loss and implement another key idea, the cycle consistency. The idea of using transitivity as a way to reg- ularize structured data has a long history. In visual tracking, enforcing simple forward-backward consistency has been a standard trick for decades [44]. In this case the authors introduce a loop in the GAN architecture to ensure that, starting from a false generated image, it is possible to retrieve the original image which is the input of the net. The following image shows the high level flow of the CycleGAN: 5
  • 6. We are given one set of images in domain X and a different set in domain Y. Training a mapping G : X −→ Y such that the output y = G(x), x ∈ X, is indistinguishable from images y ∈ Y, “does not guarantee that the individual inputs and outputs x and y are paired up in a meaningful way − there are infinitely many mappings G that will induce the same distribution over y”. The model learns two mapping functions G: X ∈ Y and F: Y ∈ X with their associated discriminators DY and DX. As described before, the discriminators aim to distinguish a real image from a generated one. The other two images, (b) and (c), represent the idea of the cycle-consistency loss. When the input x is fed to the generator G, it produces a image ˆy which is indistinguishable for DY from an original image belonging to domain Y. In the same way, the generator F learns to transform an image y into a fake ˆx. The cycle loss ensures that from the two fake images it is still possible to go back to the original inputs. The generator G so is also applied to ˆx to produce another y which should be as close as possible to the original y. The same applies, viceversa, to generator F, so, in formula, x −→ G(x) −→ F(G(x)) ≈ x and y −→ F(y) −→ G(F(y)) ≈ y 3 Proposed Methods We tried several approaches to improve the quality of video translation using a CycleGAN. We started with basic postprocessing methods applied to CycleGAN outputs, which didn’t lead to good results. A good video translation quality was instead obtained by directly modifying the structure of the CycleGAN itself. Here we describe our experiments and show the results for each of them. The goal of the task is translating a video from day light to night as better as possible according to human evaluation. All the experiments below have therefore this purpose. The CycleGAN architecture used is the one 6
  • 7. published in [11]1 . The dataset used for the experiments is composed by two videos, both recorded with a phone, manually, 59 seconds long: the scene is exactly the same for both, the first one is recorded with day light and the second one at night. In Figure 1 we show some sample images of the dataset. The images are resized to 300x300 before being processed by the net, and the outputs are 300x300 as well. The reason is that CycleGAN is quite memory-intensive as four networks (two generators and two discriminators) need to be loaded on one GPU, so a large image cannot be entirely loaded. Figure 1: Dataset samples: first raw shows daylight images and second raw shows night images 3.1 Baseline method Our baseline is the following. We extract all the frames from both day and night videos, then we sample one random frame every 10 frames for both of them, which results in 196 images for daylight and 196 images for night. The sampled images are the training data for the CycleGAN. We train the net on this data for 200 epochs. We get two new generators A and B which have learned how to transform a day image into a night image 1 7
  • 8. and vice versa. We then apply the generator A to all the frames extracted from the day light video and create a new night video out of them, which is compared to the original night video. In 2 we show 3 original images and the correspondent translated ones, while in 3 we show 4 consecutive frames translated by the CycleGAN. Figure 2: Baseline samples: first raw shows ground truth images and second raw shows CycleGAN outputs Figure 3: Consecutive CycleGAN samples considered as baseline. 8
  • 9. 3.2 Method with postprocessing Now we show the performance of a postprocessing method based on optical flows, which is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Our idea is the following. For each image of the translated CycleGAN output, we compute the flow between that image I and the other [I-n, I+n] images, where n determines the length of the window, w. Then we replaced each pixel of I with the average of all the correspondent pixels inside the window w, using the flow information. The flow information between two images provides the correspondence of the pixels between the two images. In a small video, these corresponding pixels should have similar color. So for each pixel in a transferred image, we can use these correspondence to find correspondent pixels in other trans- ferred images. Then we can just compute the mean RGB value of all these pixels to replace the original RGB value. For example, suppose the size of image1 is [H, W], then the flow file will contains a matrix F which has a size of [H, W, 2]. F(:, : , 0) encodes the displacement in the X axis, and F(:, : ,1) encodes the displacement in the Y axis. So for a pixel P1 located at [x1, y1] in image1, its correspondence P2 in the image2 will be [x1+F(x1, y1, 0), y1+F(x1, y1, 1)]. We locate the correspondent pixels of P1 for all the images inside the window w and then we average them. The results with n = 4 and so w = 9 are shown in 4. 9
  • 10. Figure 4: First raw shows CycleGAN baseline outputs and second raw shows the same images after postprocessing. We tried different window sizes but the results always resulted blurry and not consistent, so we decided to focus on other ways to improve the outputs of the network. 3.3 Flow-guided CycleGAN We decided to investigate another approach, which would ensure a bet- ter temporal consistency directly during the training process. To do so, we used the optic flow information by embedding it inside the CycleGAN. Normally the loss function of the CycleGAN is updated every time a new image is processed. Instead, we process two images and update the loss function at the end, in order to include inside the loss a part which de- rives from the temporal relationship between two consecutive frames. In- deed, we added the flow estimation loss into the CycleGAN. To do so, we changed the input from single frame to two consecutive frames. For example, we consider as input [D1, D2] and [N1, N2]. D1 and D2 are two consecutive frames at daytime, and N1 and N2 are two consecutive frames at nighttime. Then our loss function can be defined as L(G, F) = CycleGANLoss(G, F)+E[|Flow(G(F(N2)), N1)−Flow(N2, N1)|], where 10
  • 11. the last two losses are for the flows, which make sure that the genera- tive frames maintain similar flows to the original frames. We create a tri- angle of flow losses to improve the temporal consistency. Given N1 and N2 original frames, N1’ and N2’ generated frames, we include in the loss function 3 distances: |Flow(N1 , N2) − Flow(N1, N2)|, |Flow(N1, N2 ) − Flow(N1, N2)|, and|Flow(N1 , N2 )−Flow(N1, N2)|. The results are shown in figure 5. Figure 5: First raw shows CycleGAN baseline outputs and second raw shows the same images computed after adding the flow loss. We highlighted different parts of the transferred images which have a better level of detail than the baseline ones, but the main advantage of our new method is to make the reconstructed video more stable. It is not optimal to appreciate the much better video temporal consistency with images, but the final result noticeably outperforms the baseline based on a human eye judgment. Finally we wanted to analyze the generalization of the net on unseen data. We tried different training datasets and techniques, like fine-tuning or early stopping, to achieve the same results on different scenes and envi- ronments which are not part of the training data. Our experiments showed very clearly that the net doesn’t perform well when training and test data 11
  • 12. have different structures or represent different scenes. Even if training data is composed by videos from only two different locations, the transferred style learned by the net is somehow in between the two original ones and performs poorly on the test data. For this reason, we captured data from different locations which look very similar in terms of street appearance, building design, trees and so on. All of the training videos are taken from a point A to a point B, both for daylight and night. Then we captured additional data from point B to C, which we used for test data. We trained the net with 400 random frames from each training video for 100 epochs. Training data is shown in Figures 6 and 7. The results are shown in Figures 8 and 9. Figure 6: First raw shows daylight images from the training set in the first location, second raw shows night images from the training set in the first location 12
  • 13. Figure 7: First raw shows daylight images from the training set in the second location, second raw shows night images from the training set in the second location Figure 8: First raw shows images from the test set in the first location, second raw shows the correspondent transferred images using CycleGAN with a flow loss term 13
  • 14. Figure 9: First raw shows images from the test set in the second location, second raw shows the correspondent transferred images using CycleGAN with a flow loss term The daylight images are new to the net and the correspondent night images are obtained by feeding them into the generator day to night. Our conclusions are the following: we found that a big challenge in this task is that the CycleGAN tends to overfit on certain particular features of the training set and is not able to generalize well when the test data differ from the training one. What’s more, when the training involves more than one video and the videos don’t have similar structure, it fails to learn how to transform a test video into its correct correspondent. In the latter case, it seems that the net overfits on certain features from the different domains and mix them all together during the test phase. 4 Conclusions In this report we demonstrated how it is possible to achieve better perfor- mance using a CycleGAN on the video to video translation task. There is still a lot which can be improved. First of all, we didn’t tune any hyperpa- rameter and we kept the original net architecture. There are other possible ideas which can be applied using the flow information: it can be computed initially and used as input to the net, for example. Another approach could be feeding two or more images as a single tensor and assume that the net will 14
  • 15. learn some internal representation about the temporal constraints between the images. Other video to video translation papers have been published and it may also be interesting to apply different concepts extrapolated from them into a CycleGAN, for example from [8]. References [1] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015. [2] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016. [3] Mingming He, Jing Liao, Lu Yuan, and Pedro V Sander. Neural color transfer between images. arXiv preprint arXiv:1710.00756, 2017. [4] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of varia- tion in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pages 5040–5048, 2016. [5] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised repre- sentation learning with deep convolutional generative adversarial net- works. arXiv preprint arXiv:1511.06434, 2015. [6] Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. Artistic style transfer for videos. In German Conference on Pattern Recognition, pages 26–36. Springer, 2016. [7] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016. [8] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018. 15
  • 16. [9] Jun-Yan Zhu, Philipp Kr¨ahenb¨uhl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016. [10] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Un- paired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint, 2017. [11] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Un- paired image-to-image translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017. 16