FAST VIDEO ARTISTIC TRANSFER VIA MOTION COMPENSATIONijma
Techniques for conversion of natural video scenes into drawing-style videos are frequently used to produce animated movies. In the past, the conversion was manually performed, what demanded a lot of time and a high production cost. Recently, with the advancement of computer vision techniques and the development of new deep learning algorithms, drawing' can be automatically performed. Nevertheless, current`drawing' algorithms are computationally expensive and require a high processing time. In this letter, we present a simple, but effective `drawing' algorithm that is capable of reducing the processing time.
We presents a deep architecture for dense semantic correspondence, called pyramidal affine regression networks (PARN), that estimates locally-varying affine transformation fields across images.
To deal with intra-class appearance and shape variations that commonly exist among different instances within the same object category,
we leverage a pyramidal model where affine transformation fields are progressively estimated in a coarse-to-fine manner so that the smoothness constraint is naturally imposed within deep networks.
PARN estimates residual affine transformations at each level and composes them to estimate final affine transformations.
Furthermore, to overcome the limitations of insufficient training data for semantic correspondence, we propose a novel weakly-supervised training scheme that generates progressive supervisions by leveraging a correspondence consistency across image pairs.
Our method is fully learnable in an end-to-end manner and does not require quantizing infinite continuous affine transformation fields.
본 논문은 single depth map으로부터의 정확한 3D hand pose estimation을 목표로 한다. 3D hand pose estimation은 HCI, AR등의 기술을 구현함에 있어서 매우 중요한 기술이다. 이를 위해 많은 연구자들이 정확도를 높이기 위해 여러 방법을 제시하였지만, 여전히 손가락들의 비슷한 생김새, 가려짐, 다양한 손가락의 움직임으로 인한 복잡성 때문에 정확도를 올리는데 한계가 있었다. 본 논문은 기존 방법들의 한계를 극복하기 위해 기존 방법들이 사용하는 입력 형태와 출력 형태를 바꾸었다. 2d depth image를 입력으로 받아 hand joint의 3D coordinate를 직접 regress하는 대부분의 기존 방법들과는 달리, 제안하는 모델은 3D voxelized depth map을 입력으로 받아 3D heatmap을 출력한다. 이를 위해 encoder-decoder 형식의 3D CNN을 사용하였고, 달라진 입력과 출력 형태로 인해 제안하는 모델은 널리 사용되는 3개의 3d hand pose estimation dataset, 1개의 3d human pose estimation dataset에서 가장 높은 성능을 내었다. 또한 ICCV 2017에서 주최된 HANDS 2017 challenge에서 우승 하였다.
FAST VIDEO ARTISTIC TRANSFER VIA MOTION COMPENSATIONijma
Techniques for conversion of natural video scenes into drawing-style videos are frequently used to produce animated movies. In the past, the conversion was manually performed, what demanded a lot of time and a high production cost. Recently, with the advancement of computer vision techniques and the development of new deep learning algorithms, drawing' can be automatically performed. Nevertheless, current`drawing' algorithms are computationally expensive and require a high processing time. In this letter, we present a simple, but effective `drawing' algorithm that is capable of reducing the processing time.
We presents a deep architecture for dense semantic correspondence, called pyramidal affine regression networks (PARN), that estimates locally-varying affine transformation fields across images.
To deal with intra-class appearance and shape variations that commonly exist among different instances within the same object category,
we leverage a pyramidal model where affine transformation fields are progressively estimated in a coarse-to-fine manner so that the smoothness constraint is naturally imposed within deep networks.
PARN estimates residual affine transformations at each level and composes them to estimate final affine transformations.
Furthermore, to overcome the limitations of insufficient training data for semantic correspondence, we propose a novel weakly-supervised training scheme that generates progressive supervisions by leveraging a correspondence consistency across image pairs.
Our method is fully learnable in an end-to-end manner and does not require quantizing infinite continuous affine transformation fields.
본 논문은 single depth map으로부터의 정확한 3D hand pose estimation을 목표로 한다. 3D hand pose estimation은 HCI, AR등의 기술을 구현함에 있어서 매우 중요한 기술이다. 이를 위해 많은 연구자들이 정확도를 높이기 위해 여러 방법을 제시하였지만, 여전히 손가락들의 비슷한 생김새, 가려짐, 다양한 손가락의 움직임으로 인한 복잡성 때문에 정확도를 올리는데 한계가 있었다. 본 논문은 기존 방법들의 한계를 극복하기 위해 기존 방법들이 사용하는 입력 형태와 출력 형태를 바꾸었다. 2d depth image를 입력으로 받아 hand joint의 3D coordinate를 직접 regress하는 대부분의 기존 방법들과는 달리, 제안하는 모델은 3D voxelized depth map을 입력으로 받아 3D heatmap을 출력한다. 이를 위해 encoder-decoder 형식의 3D CNN을 사용하였고, 달라진 입력과 출력 형태로 인해 제안하는 모델은 널리 사용되는 3개의 3d hand pose estimation dataset, 1개의 3d human pose estimation dataset에서 가장 높은 성능을 내었다. 또한 ICCV 2017에서 주최된 HANDS 2017 challenge에서 우승 하였다.
https://imatge.upc.edu/web/publications/importance-time-visual-attention-models
Bachelor thesis by Marta Cool, advised by Kevin McGuinness (Dublin City University) and Xavier Giro-i-Nieto (Universitat Politecnica de Catalunya).
Predicting visual attention is a very active eld in the computer vision community. Visual attention is a mechanism of the visual system that can select relevant areas within a scene. Models for saliency prediction are intended to automatically predict which regions are likely to be attended by a human observer. Traditionally, ground truth saliency maps are built using only the spatial position of the fixation points, being these xation points the locations where an observer fixates the gaze when viewing a scene. In this work we explore encoding the temporal information as well, and assess it in the application of prediction saliency maps with deep neural networks. It has been observed that the later fixations in a scanpath are usually selected randomly during visualization, specially in those images with few regions of interest. Therefore, computer vision models have dificulties learning to predict them. In this work, we explore a temporal weighting over the saliency maps to better cope with this random behaviour. The newly proposed saliency representation assigns dierent weights depending on the position in the sequence of gaze fixations, giving more importance to early timesteps than later ones. We used this maps to train MLNet, a state of the art for predicting saliency maps. MLNet predictions were evaluated and compared to the results obtained when the model has been trained using traditional saliency maps. Finally, we show how the temporally weighted saliency maps brought some improvement when used to weight the visual features in an image retrieval task.
Intel, Intelligent Systems Lab: Syable View Synthesis WhitepaperAlejandro Franceschi
Intel, Intelligent Systems Lab:
Stable View Synthesis Whitepaper
We present Stable View Synthesis (SVS). Given a set
of source images depicting a scene from freely distributed
viewpoints, SVS synthesizes new views of the scene. The
method operates on a geometric scaffold computed via
structure-from-motion and multi-view stereo. Each point
on this 3D scaffold is associated with view rays and corresponding feature vectors that encode the appearance of
this point in the input images.
The core of SVS is view dependent on-surface feature aggregation, in which directional feature vectors at each 3D point are processed to produce a new feature vector for a ray that maps this point into the new target view.
The target view is then rendered by a convolutional network from a tensor of features synthesized in this way for all pixels. The method is composed of differentiable modules and is trained end-to-end. It supports spatially-varying view-dependent importance weighting and feature transformation of source images at each point; spatial and temporal stability due to the smooth dependence of on-surface feature aggregation on the target view; and synthesis of view-dependent effects such as specular reflection.
Experimental results demonstrate that SVS outperforms state-of-the-art view synthesis methods both quantitatively and qualitatively on three diverse realworld datasets, achieving unprecedented levels of realism in free-viewpoint video of challenging large-scale scenes.
A systematic image compression in the combination of linear vector quantisati...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Intel Intelligent Systems Labs:
Enhancing Photorealism Enhancement
Abstract:
We present an approach to enhancing the realism of synthetic images. The images are enhanced by a convolutional network that leverages intermediate representations produced by conventional rendering pipelines. The network is trained via a novel adversarial objective, which provides strong supervision at multiple perceptual levels. We analyze scene layout distributions in commonly used datasets and find that they differ in important ways. We hypothesize that this is one of the causes of strong artifacts that can be observed in the results of many prior methods. To address this we propose a new strategy for sampling image patches during training. We also introduce multiple architectural improvements in the deep network modules used for photorealism enhancement. We confirm the benefits of our contributions in controlled experiments and report substantial gains in stability and realism in comparison to recent image-to-image translation methods and a variety of other baselines.
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...Dr. Amarjeet Singh
Existing Medical imaging techniques such as fMRI, positron emission tomography (PET), dynamic 3D ultrasound and dynamic computerized tomography yield large amounts of four-dimensional sets. 4D medical data sets are the series of volumetric images netted in time, large in size and demand a great of assets for storage and transmission. Here, in this paper, we present a method wherein 3D image is taken and Discrete Wavelet Transform(DWT) and Dual-Tree Complex Wavelet Transform(DTCWT) techniques are applied separately on it and the image is split into sub-bands. The encoding and decoding are done using 3D-SPIHT, at different bit per pixels(bpp). The reconstructed image is synthesized using Inverse DWT technique. The quality of the compressed image has been evaluated using some factors such as Mean Square Error(MSE) and Peak-Signal to Noise Ratio (PSNR).
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...Alejandro Franceschi
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight Portraits for Background Replacement
Abstract:
Given a portrait and an arbitrary high dynamic range lighting environment, our framework uses machine learning to composite the subject into a new scene, while accurately modeling their appearance in the target illumination condition. We estimate a high quality alpha matte, foreground element, albedo map, and surface normals, and we propose a novel, per-pixel lighting representation within a deep learning framework.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Most of the existing co-segmentation methods are usually
complex, and require pre-grouping of images, fine-tuning a
few parameters and initial segmentation masks etc. These
limitations become serious concerns for their application on
large scale datasets. In this paper, Group Saliency Propagation
(GSP) model is proposed where a single group saliency
map is developed, which can be propagated to segment the
entire group. In addition, it is also shown how a pool of these
group saliency maps can help in quickly segmenting new
input images. Experiments demonstrate that the proposed
method can achieve competitive performance on several
benchmark co-segmentation datasets including ImageNet,
with the added advantage of speed up.
In the current scenario compression of video files is in high demand. Color video compression has become a significant technology to lessen the memory space and to decrease transmission time. Video compression using fractal technique is based on self similarity concept by comparing the range block and domain block. However, its computational complexity is very high. In this paper we presented hybrid video compression technique to compress Audio/Video Interleaved file and overcome the problem of Computational complexity. We implemented Discrete Wavelet Transform and hybrid fractal HV partition technique using Particle Swarm Optimization (called mapping of PSO) for compression of videos. The analysis demonstrate that hybrid technique gives a very good speed up to compress video and achieve Peak Signal to Noise Ratio.
COMPRESSION ALGORITHM SELECTION FOR MULTISPECTRAL MASTCAM IMAGESsipij
The two mast cameras (Mastcam) onboard the Mars rover, Curiosity, are multispectral imagers with nine
bands in each camera. Currently, the images are compressed losslessly using JPEG, which can achieve
only two to three times compression. We present a two-step approach to compressing multispectral
Mastcam images. First, we propose to apply principal component analysis (PCA) to compress the nine
bands into three or six bands. This step optimally compresses the 9-band images through spectral
correlation between the bands. Second, several well-known image compression codecs, such as JPEG, JPEG-2000 (J2K), X264, and X265, in the literature are applied to compress the 3-band or 6-band images
coming out of PCA. The performance of dif erent algorithms was assessed using four well-known
performance metrics. Extensive experiments using actual Mastcam images have been performed to
demonstrate the proposed framework. We observed that perceptually lossless compression can be achieved
at a 10:1 compression ratio. In particular, the performance gain of an approach using a combination of
PCA and X265 is at least 5 dBs in terms peak signal-to-noise ratio (PSNR) at a 10:1 compression ratio
over that of JPEG when using our proposed approach.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES sipij
Human action recognition is still a challenging problem and researchers are focusing to investigate this
problem using different techniques. We propose a robust approach for human action recognition. This is
achieved by extracting stable spatio-temporal features in terms of pairwise local binary pattern (P-LBP)
and scale invariant feature transform (SIFT). These features are used to train an MLP neural network
during the training stage, and the action classes are inferred from the test videos during the testing stage.
The proposed features well match the motion of individuals and their consistency, and accuracy is higher
using a challenging dataset. The experimental evaluation is conducted on a benchmark dataset commonly
used for human action recognition. In addition, we show that our approach outperforms individual features
i.e. considering only spatial and only temporal feature.
InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images
Abstract. We present a method for learning to generate unbounded flythrough videos of natural scenes starting from a single view, where this capability is learned from a collection of single photographs, without requiring camera poses or even multiple views of each scene. To achieve this, we propose a novel self-supervised view generation training paradigm, where we sample and rendering virtual camera trajectories, including cyclic ones, allowing our model to learn stable view generation from a collection of single views. At test time, despite never seeing a video during training, our approach can take a single image and generate long camera trajectories comprised of hundreds of new views with realistic and diverse contents. We compare our approach with recent state-of-the-art supervised view generation methods that require posed multi-view videos and demonstrate superior performance and synthesis quality.
https://imatge.upc.edu/web/publications/importance-time-visual-attention-models
Bachelor thesis by Marta Cool, advised by Kevin McGuinness (Dublin City University) and Xavier Giro-i-Nieto (Universitat Politecnica de Catalunya).
Predicting visual attention is a very active eld in the computer vision community. Visual attention is a mechanism of the visual system that can select relevant areas within a scene. Models for saliency prediction are intended to automatically predict which regions are likely to be attended by a human observer. Traditionally, ground truth saliency maps are built using only the spatial position of the fixation points, being these xation points the locations where an observer fixates the gaze when viewing a scene. In this work we explore encoding the temporal information as well, and assess it in the application of prediction saliency maps with deep neural networks. It has been observed that the later fixations in a scanpath are usually selected randomly during visualization, specially in those images with few regions of interest. Therefore, computer vision models have dificulties learning to predict them. In this work, we explore a temporal weighting over the saliency maps to better cope with this random behaviour. The newly proposed saliency representation assigns dierent weights depending on the position in the sequence of gaze fixations, giving more importance to early timesteps than later ones. We used this maps to train MLNet, a state of the art for predicting saliency maps. MLNet predictions were evaluated and compared to the results obtained when the model has been trained using traditional saliency maps. Finally, we show how the temporally weighted saliency maps brought some improvement when used to weight the visual features in an image retrieval task.
Intel, Intelligent Systems Lab: Syable View Synthesis WhitepaperAlejandro Franceschi
Intel, Intelligent Systems Lab:
Stable View Synthesis Whitepaper
We present Stable View Synthesis (SVS). Given a set
of source images depicting a scene from freely distributed
viewpoints, SVS synthesizes new views of the scene. The
method operates on a geometric scaffold computed via
structure-from-motion and multi-view stereo. Each point
on this 3D scaffold is associated with view rays and corresponding feature vectors that encode the appearance of
this point in the input images.
The core of SVS is view dependent on-surface feature aggregation, in which directional feature vectors at each 3D point are processed to produce a new feature vector for a ray that maps this point into the new target view.
The target view is then rendered by a convolutional network from a tensor of features synthesized in this way for all pixels. The method is composed of differentiable modules and is trained end-to-end. It supports spatially-varying view-dependent importance weighting and feature transformation of source images at each point; spatial and temporal stability due to the smooth dependence of on-surface feature aggregation on the target view; and synthesis of view-dependent effects such as specular reflection.
Experimental results demonstrate that SVS outperforms state-of-the-art view synthesis methods both quantitatively and qualitatively on three diverse realworld datasets, achieving unprecedented levels of realism in free-viewpoint video of challenging large-scale scenes.
A systematic image compression in the combination of linear vector quantisati...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Intel Intelligent Systems Labs:
Enhancing Photorealism Enhancement
Abstract:
We present an approach to enhancing the realism of synthetic images. The images are enhanced by a convolutional network that leverages intermediate representations produced by conventional rendering pipelines. The network is trained via a novel adversarial objective, which provides strong supervision at multiple perceptual levels. We analyze scene layout distributions in commonly used datasets and find that they differ in important ways. We hypothesize that this is one of the causes of strong artifacts that can be observed in the results of many prior methods. To address this we propose a new strategy for sampling image patches during training. We also introduce multiple architectural improvements in the deep network modules used for photorealism enhancement. We confirm the benefits of our contributions in controlled experiments and report substantial gains in stability and realism in comparison to recent image-to-image translation methods and a variety of other baselines.
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...Dr. Amarjeet Singh
Existing Medical imaging techniques such as fMRI, positron emission tomography (PET), dynamic 3D ultrasound and dynamic computerized tomography yield large amounts of four-dimensional sets. 4D medical data sets are the series of volumetric images netted in time, large in size and demand a great of assets for storage and transmission. Here, in this paper, we present a method wherein 3D image is taken and Discrete Wavelet Transform(DWT) and Dual-Tree Complex Wavelet Transform(DTCWT) techniques are applied separately on it and the image is split into sub-bands. The encoding and decoding are done using 3D-SPIHT, at different bit per pixels(bpp). The reconstructed image is synthesized using Inverse DWT technique. The quality of the compressed image has been evaluated using some factors such as Mean Square Error(MSE) and Peak-Signal to Noise Ratio (PSNR).
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight P...Alejandro Franceschi
Google Research Siggraph Whitepaper | Total Relighting: Learning to Relight Portraits for Background Replacement
Abstract:
Given a portrait and an arbitrary high dynamic range lighting environment, our framework uses machine learning to composite the subject into a new scene, while accurately modeling their appearance in the target illumination condition. We estimate a high quality alpha matte, foreground element, albedo map, and surface normals, and we propose a novel, per-pixel lighting representation within a deep learning framework.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Most of the existing co-segmentation methods are usually
complex, and require pre-grouping of images, fine-tuning a
few parameters and initial segmentation masks etc. These
limitations become serious concerns for their application on
large scale datasets. In this paper, Group Saliency Propagation
(GSP) model is proposed where a single group saliency
map is developed, which can be propagated to segment the
entire group. In addition, it is also shown how a pool of these
group saliency maps can help in quickly segmenting new
input images. Experiments demonstrate that the proposed
method can achieve competitive performance on several
benchmark co-segmentation datasets including ImageNet,
with the added advantage of speed up.
In the current scenario compression of video files is in high demand. Color video compression has become a significant technology to lessen the memory space and to decrease transmission time. Video compression using fractal technique is based on self similarity concept by comparing the range block and domain block. However, its computational complexity is very high. In this paper we presented hybrid video compression technique to compress Audio/Video Interleaved file and overcome the problem of Computational complexity. We implemented Discrete Wavelet Transform and hybrid fractal HV partition technique using Particle Swarm Optimization (called mapping of PSO) for compression of videos. The analysis demonstrate that hybrid technique gives a very good speed up to compress video and achieve Peak Signal to Noise Ratio.
COMPRESSION ALGORITHM SELECTION FOR MULTISPECTRAL MASTCAM IMAGESsipij
The two mast cameras (Mastcam) onboard the Mars rover, Curiosity, are multispectral imagers with nine
bands in each camera. Currently, the images are compressed losslessly using JPEG, which can achieve
only two to three times compression. We present a two-step approach to compressing multispectral
Mastcam images. First, we propose to apply principal component analysis (PCA) to compress the nine
bands into three or six bands. This step optimally compresses the 9-band images through spectral
correlation between the bands. Second, several well-known image compression codecs, such as JPEG, JPEG-2000 (J2K), X264, and X265, in the literature are applied to compress the 3-band or 6-band images
coming out of PCA. The performance of dif erent algorithms was assessed using four well-known
performance metrics. Extensive experiments using actual Mastcam images have been performed to
demonstrate the proposed framework. We observed that perceptually lossless compression can be achieved
at a 10:1 compression ratio. In particular, the performance gain of an approach using a combination of
PCA and X265 is at least 5 dBs in terms peak signal-to-noise ratio (PSNR) at a 10:1 compression ratio
over that of JPEG when using our proposed approach.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
HUMAN ACTION RECOGNITION IN VIDEOS USING STABLE FEATURES sipij
Human action recognition is still a challenging problem and researchers are focusing to investigate this
problem using different techniques. We propose a robust approach for human action recognition. This is
achieved by extracting stable spatio-temporal features in terms of pairwise local binary pattern (P-LBP)
and scale invariant feature transform (SIFT). These features are used to train an MLP neural network
during the training stage, and the action classes are inferred from the test videos during the testing stage.
The proposed features well match the motion of individuals and their consistency, and accuracy is higher
using a challenging dataset. The experimental evaluation is conducted on a benchmark dataset commonly
used for human action recognition. In addition, we show that our approach outperforms individual features
i.e. considering only spatial and only temporal feature.
InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images
Abstract. We present a method for learning to generate unbounded flythrough videos of natural scenes starting from a single view, where this capability is learned from a collection of single photographs, without requiring camera poses or even multiple views of each scene. To achieve this, we propose a novel self-supervised view generation training paradigm, where we sample and rendering virtual camera trajectories, including cyclic ones, allowing our model to learn stable view generation from a collection of single views. At test time, despite never seeing a video during training, our approach can take a single image and generate long camera trajectories comprised of hundreds of new views with realistic and diverse contents. We compare our approach with recent state-of-the-art supervised view generation methods that require posed multi-view videos and demonstrate superior performance and synthesis quality.
Decomposing image generation into layout priction and conditional synthesisNaeem Shehzad
in this presentation you can learn how to decompose an image into layout and find the predictions. In this presentation , I mention all the data in very convenient way , I hope you can take it easy.
Thank you.
Style transfer aims to combine the content of one image with the artistic style of another. It was discovered that lower levels of convolutional networks captured style information, while higher levels captures content information. The original style transfer formulation used a weighted combination of VGG-16 layer activations to achieve this goal. Later, this was accomplished in real-time using a feed-forward network to learn the optimal combination of style and content features from the respective images. The first aim of our project was to introduce a framework for capturing the style from several images at once. We propose a method that extends the original real-time style transfer formulation by combining the features of several style images. This method successfully captures color information from the separate style images. The other aim of our project was to improve the temporal style continuity from frame to frame. Accordingly, we have experimented with the temporal stability of the output images and discussed the various available techniques that could be employed as alternatives.
Recognition and tracking moving objects using moving camera in complex scenesIJCSEA Journal
In this paper, we propose a method for effectively tracking moving objects in videos captured using a
moving camera in complex scenes. The video sequences may contain highly dynamic backgrounds and
illumination changes. Four main steps are involved in the proposed method. First, the video is stabilized
using affine transformation. Second, intelligent selection of frames is performed in order to extract only
those frames that have a considerable change in content. This step reduces complexity and computational
time. Third, the moving object is tracked using Kalman filter and Gaussian mixture model. Finally object
recognition using Bag of features is performed in order to recognize the moving objects.
Image Captioning Generator using Deep Machine Learningijtsrd
Technologys scope has evolved into one of the most powerful tools for human development in a variety of fields.AI and machine learning have become one of the most powerful tools for completing tasks quickly and accurately without the need for human intervention. This project demonstrates how deep machine learning can be used to create a caption or a sentence for a given picture. This can be used for visually impaired persons, as well as automobiles for self identification, and for various applications to verify quickly and easily. The Convolutional Neural Network CNN is used to describe the alphabet, and the Long Short Term Memory LSTM is used to organize the right meaningful sentences in this model. The flicker 8k and flicker 30k datasets were used to train this. Sreejith S P | Vijayakumar A "Image Captioning Generator using Deep Machine Learning" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42344.pdf Paper URL: https://www.ijtsrd.comcomputer-science/artificial-intelligence/42344/image-captioning-generator-using-deep-machine-learning/sreejith-s-p
Performance analysis on color image mosaicing techniques on FPGAIJECEIAES
Today, the surveillance systems and other monitoring systems are considering the capturing of image sequences in a single frame. The captured images can be combined to get the mosaiced image or combined image sequence. But the captured image may have quality issues like brightness issue, alignment issue (correlation issue), resolution issue, manual image registration issue etc. The existing technique like cross correlation can offer better image mosaicing but faces brightness issue in mosaicing. Thus, this paper introduces two different methods for mosaicing i.e., (a) Sliding Window Module (SWM) based Color Image Mosaicing (CIM) and (b) Discrete Cosine Transform (DCT) based CIM on Field Programmable Gate Array (FPGA). The SWM based CIM adopted for corner detection of two images and perform the automatic image registration while DCT based CIM aligns both the local as well as global alignment of images by using phase correlation approach. Finally, these two methods performances are analyzed by comparing with parameters like PSNR, MSE, device utilization and execution time. From the analysis it is concluded that the DCT based CIM can offers significant results than SWM based CIM.
BIG DATA-DRIVEN FAST REDUCING THE VISUAL BLOCK ARTIFACTS OF DCT COMPRESSED IM...IJDKP
The Urban Surveillance Systems generate huge amount of video and image data and impose high pressure
onto the recording disks. It is obvious that the research of video is a key point of big data research areas.
Since videos are composed of images, the degree and efficiency of image compression are of great
importance. Although the DCT based JPEG standard are widely used, it encounters insurmountable
problems. For instance, image encoding deficiencies such as block artifacts have to be removed frequently.
In this paper, we propose a new, simple but effective method to fast reduce the visual block artifacts of DCT
compressed images for urban surveillance systems. The simulation results demonstrate that our proposed
method achieves better quality than widely used filters while consuming much less computer CPU
resources.
https://github.com/telecombcn-dl/dlmm-2017-dcu
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
IJCER (www.ijceronline.com) International Journal of computational Engineeri...ijceronline
Call for paper 2012, hard copy of Certificate, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJCER, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, research and review articles, IJCER Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathematics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer review journal, indexed journal, research and review articles, engineering journal, www.ijceronline.com, research journals,
yahoo journals, bing journals, International Journal of Computational Engineering Research, Google journals, hard copy of Certificate,
journal of engineering, online Submission
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
1. Video-to-Video Translation using
Cycle-Consistent Adversarial
Networks
Alessandro Calmanovici
Supervisor: Zhaopeng Cui
Computer Vision and Geometry Group
Institute for Visual Computing
ETH Zurich
November 27, 2019
Abstract
Image-to-image translation is the task of translating an image to
a different style or domain given paired or unpaired image examples
at training time. Video-to-video translation however is a harder task.
Translating a video means not only learning the structural features
and appearance of objects and different scenes but it also requires
realistic transitions and temporal consistent passages between con-
secutive frames. In this report we explore new ideas and approaches
to video-to-video translation using existing image-to-image transla-
tion networks, in particular CycleGANs. We investigate how a new
loss term of the network which takes into account the flow informa-
tion between two consecutive frames can improve the performance
on the task. We focus on a specific style transfer, which is translat-
ing a video from day to night and viceversa. We compare our results
to a baseline obtained by transferring day to night with a standard
CycleGAN to each frame of our dataset and propose further possible
optimizations of the model.
1
4. 1 Introduction
Recently the image-to-image translation topic has experienced a big growth
and most of the approaches are based on deep neural networks. Here Gatys
et al [2] perform image style transfer using convolutional neural networks,
which manage to separate image content from style and obtain explicit
representations of semantic information. In particular, they show how to
produce new images of high perceptual quality that combine the content
of an arbitrary photograph with the appearance of numerous well known
artworks.
A similar attempt is [3], where the authors propose a new algorithm
for color transfer between images that have perceptually similar semantic
structure, optimizing a linear model with both local and global constraints.
Their method also exploits neural representations which are deep features
extracted from a CNN encoding.
Another original idea is exposed in [10]. They present an approach
(CycleGAN) for learning to translate an image from a source domain X
to a target domain Y in the absence of paired training examples. Their
results show good performance on several tasks where paired training data
does not exist, including collection style transfer, object transfiguration,
season transfer, photo enhancement, etc. Again, they use convolutional
neural networks and deep neural networks to extract features from the two
domains.
Video-to-video translation however is a harder problem. The direct
application of image-based approaches on videos may lead to a lot of in-
consistencies, with one of the major issues being the lack of explicit in-
formation about temporal constraints between images during the training
process. What’s more, most of image-based methods require paired data
and matching frames between videos is still an open problem.
In [6], Ruder et al. presented an approach that transfers the style from
one image (for example, a painting) to a whole video sequence. Processing
each frame of the video independently leads to flickering and false discon-
tinuities, since the solution of the style transfer task is not stable. To
tackle this problem, they introduce a temporal constraint that penalizes
deviations between two frames using the optical flow from the original
video. Besides that, they also initialize the optimization for the frame i
+ 1 with the stylized frame i. Very recently, Wang et al. [8] propose a novel
video-to-video synthesis approach under the generative adversarial learning
framework which requires paired input training data. They achieve high-
4
5. resolution, photo realistic, temporally coherent video results on a diverse
set of input formats including segmentation masks, sketches, and poses.
This project instead aims to present an approach to learn how to trans-
late a video from a source domain X to a target domain Y, without the
constraint of having precise paired X and Y inputs. The first steps are
attempts that consist in simply applying image-based approaches directly
on videos, using post-processing techniques and showing the poor results
obtained. Then we propose a good performing solution for the the video-to-
video translation task which can void frame matching between videos using
cycle-consistent adversarial networks. We show how to improve an image
based CycleGAN translation from domain A to domain B by proposing a
new loss term included in the net loss function. The new term makes use of
flow information between consecutive frames of a video to ensure a better
temporal consistency of the produced outputs.
2 CycleGAN
In this section I describe in more detail the CycleGAN architecture and
the general idea behind it. Most of what I write here is based on the orig-
inal paper [10]. The supporting structure of a CycleGAN, as the name
itself suggests, is a GAN, a Generative Adversarial Network. GANs have
achieved impressive results in image generation [1,5] , image editing [9], and
representation learning [4,5,7]. The main feature of GANs is the concept of
”adversarial loss” which guarantees - theoretically - the generated images
to be indistinguishable from the real ones. The structure of these nets is
based on a generator network and a discriminator network, who compete
against each other: the generator tries to trick the discriminator by creat-
ing images more and more similar to the real ones, while the discriminator
learns over time how to distinguish the real images from the generator false
ones. CycleGANs exploit the adversarial loss and implement another key
idea, the cycle consistency. The idea of using transitivity as a way to reg-
ularize structured data has a long history. In visual tracking, enforcing
simple forward-backward consistency has been a standard trick for decades
[44]. In this case the authors introduce a loop in the GAN architecture to
ensure that, starting from a false generated image, it is possible to retrieve
the original image which is the input of the net. The following image shows
the high level flow of the CycleGAN:
5
6. We are given one set of images in domain X and a different set in domain
Y. Training a mapping G : X −→ Y such that the output y = G(x), x ∈
X, is indistinguishable from images y ∈ Y, “does not guarantee that the
individual inputs and outputs x and y are paired up in a meaningful way −
there are infinitely many mappings G that will induce the same distribution
over y”. The model learns two mapping functions G: X ∈ Y and F: Y ∈ X
with their associated discriminators DY and DX. As described before, the
discriminators aim to distinguish a real image from a generated one. The
other two images, (b) and (c), represent the idea of the cycle-consistency
loss. When the input x is fed to the generator G, it produces a image
ˆy which is indistinguishable for DY from an original image belonging to
domain Y. In the same way, the generator F learns to transform an image
y into a fake ˆx. The cycle loss ensures that from the two fake images it is
still possible to go back to the original inputs. The generator G so is also
applied to ˆx to produce another y which should be as close as possible to
the original y. The same applies, viceversa, to generator F, so, in formula,
x −→ G(x) −→ F(G(x)) ≈ x and y −→ F(y) −→ G(F(y)) ≈ y
3 Proposed Methods
We tried several approaches to improve the quality of video translation
using a CycleGAN. We started with basic postprocessing methods applied
to CycleGAN outputs, which didn’t lead to good results. A good video
translation quality was instead obtained by directly modifying the structure
of the CycleGAN itself. Here we describe our experiments and show the
results for each of them.
The goal of the task is translating a video from day light to night as
better as possible according to human evaluation. All the experiments below
have therefore this purpose. The CycleGAN architecture used is the one
6
7. published in [11]1
. The dataset used for the experiments is composed by two
videos, both recorded with a phone, manually, 59 seconds long: the scene is
exactly the same for both, the first one is recorded with day light and the
second one at night. In Figure 1 we show some sample images of the dataset.
The images are resized to 300x300 before being processed by the net, and
the outputs are 300x300 as well. The reason is that CycleGAN is quite
memory-intensive as four networks (two generators and two discriminators)
need to be loaded on one GPU, so a large image cannot be entirely loaded.
Figure 1: Dataset samples: first raw shows daylight images and second raw
shows night images
3.1 Baseline method
Our baseline is the following. We extract all the frames from both day
and night videos, then we sample one random frame every 10 frames for
both of them, which results in 196 images for daylight and 196 images for
night. The sampled images are the training data for the CycleGAN. We
train the net on this data for 200 epochs. We get two new generators A
and B which have learned how to transform a day image into a night image
1
https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix
7
8. and vice versa. We then apply the generator A to all the frames extracted
from the day light video and create a new night video out of them, which
is compared to the original night video. In 2 we show 3 original images and
the correspondent translated ones, while in 3 we show 4 consecutive frames
translated by the CycleGAN.
Figure 2: Baseline samples: first raw shows ground truth images and second
raw shows CycleGAN outputs
Figure 3: Consecutive CycleGAN samples considered as baseline.
8
9. 3.2 Method with postprocessing
Now we show the performance of a postprocessing method based on optical
flows, which is the pattern of apparent motion of objects, surfaces, and
edges in a visual scene caused by the relative motion between an observer
and a scene. Our idea is the following. For each image of the translated
CycleGAN output, we compute the flow between that image I and the other
[I-n, I+n] images, where n determines the length of the window, w. Then
we replaced each pixel of I with the average of all the correspondent pixels
inside the window w, using the flow information.
The flow information between two images provides the correspondence
of the pixels between the two images. In a small video, these corresponding
pixels should have similar color. So for each pixel in a transferred image, we
can use these correspondence to find correspondent pixels in other trans-
ferred images. Then we can just compute the mean RGB value of all these
pixels to replace the original RGB value.
For example, suppose the size of image1 is [H, W], then the flow file will
contains a matrix F which has a size of [H, W, 2]. F(:, : , 0) encodes the
displacement in the X axis, and F(:, : ,1) encodes the displacement in the
Y axis. So for a pixel P1 located at [x1, y1] in image1, its correspondence
P2 in the image2 will be [x1+F(x1, y1, 0), y1+F(x1, y1, 1)]. We locate the
correspondent pixels of P1 for all the images inside the window w and then
we average them. The results with n = 4 and so w = 9 are shown in 4.
9
10. Figure 4: First raw shows CycleGAN baseline outputs and second raw shows
the same images after postprocessing.
We tried different window sizes but the results always resulted blurry
and not consistent, so we decided to focus on other ways to improve the
outputs of the network.
3.3 Flow-guided CycleGAN
We decided to investigate another approach, which would ensure a bet-
ter temporal consistency directly during the training process. To do so,
we used the optic flow information by embedding it inside the CycleGAN.
Normally the loss function of the CycleGAN is updated every time a new
image is processed. Instead, we process two images and update the loss
function at the end, in order to include inside the loss a part which de-
rives from the temporal relationship between two consecutive frames. In-
deed, we added the flow estimation loss into the CycleGAN. To do so,
we changed the input from single frame to two consecutive frames. For
example, we consider as input [D1, D2] and [N1, N2]. D1 and D2 are
two consecutive frames at daytime, and N1 and N2 are two consecutive
frames at nighttime. Then our loss function can be defined as L(G, F) =
CycleGANLoss(G, F)+E[|Flow(G(F(N2)), N1)−Flow(N2, N1)|], where
10
11. the last two losses are for the flows, which make sure that the genera-
tive frames maintain similar flows to the original frames. We create a tri-
angle of flow losses to improve the temporal consistency. Given N1 and
N2 original frames, N1’ and N2’ generated frames, we include in the loss
function 3 distances: |Flow(N1 , N2) − Flow(N1, N2)|, |Flow(N1, N2 ) −
Flow(N1, N2)|, and|Flow(N1 , N2 )−Flow(N1, N2)|. The results are shown
in figure 5.
Figure 5: First raw shows CycleGAN baseline outputs and second raw shows
the same images computed after adding the flow loss.
We highlighted different parts of the transferred images which have a
better level of detail than the baseline ones, but the main advantage of
our new method is to make the reconstructed video more stable. It is
not optimal to appreciate the much better video temporal consistency with
images, but the final result noticeably outperforms the baseline based on a
human eye judgment.
Finally we wanted to analyze the generalization of the net on unseen
data. We tried different training datasets and techniques, like fine-tuning
or early stopping, to achieve the same results on different scenes and envi-
ronments which are not part of the training data. Our experiments showed
very clearly that the net doesn’t perform well when training and test data
11
12. have different structures or represent different scenes. Even if training data
is composed by videos from only two different locations, the transferred
style learned by the net is somehow in between the two original ones and
performs poorly on the test data.
For this reason, we captured data from different locations which look
very similar in terms of street appearance, building design, trees and so on.
All of the training videos are taken from a point A to a point B, both for
daylight and night. Then we captured additional data from point B to C,
which we used for test data. We trained the net with 400 random frames
from each training video for 100 epochs. Training data is shown in Figures
6 and 7. The results are shown in Figures 8 and 9.
Figure 6: First raw shows daylight images from the training set in the first
location, second raw shows night images from the training set in the first
location
12
13. Figure 7: First raw shows daylight images from the training set in the
second location, second raw shows night images from the training set in the
second location
Figure 8: First raw shows images from the test set in the first location,
second raw shows the correspondent transferred images using CycleGAN
with a flow loss term
13
14. Figure 9: First raw shows images from the test set in the second location,
second raw shows the correspondent transferred images using CycleGAN
with a flow loss term
The daylight images are new to the net and the correspondent night
images are obtained by feeding them into the generator day to night. Our
conclusions are the following: we found that a big challenge in this task
is that the CycleGAN tends to overfit on certain particular features of the
training set and is not able to generalize well when the test data differ from
the training one. What’s more, when the training involves more than one
video and the videos don’t have similar structure, it fails to learn how to
transform a test video into its correct correspondent. In the latter case, it
seems that the net overfits on certain features from the different domains
and mix them all together during the test phase.
4 Conclusions
In this report we demonstrated how it is possible to achieve better perfor-
mance using a CycleGAN on the video to video translation task. There is
still a lot which can be improved. First of all, we didn’t tune any hyperpa-
rameter and we kept the original net architecture. There are other possible
ideas which can be applied using the flow information: it can be computed
initially and used as input to the net, for example. Another approach could
be feeding two or more images as a single tensor and assume that the net will
14
15. learn some internal representation about the temporal constraints between
the images. Other video to video translation papers have been published
and it may also be interesting to apply different concepts extrapolated from
them into a CycleGAN, for example from [8].
References
[1] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative
image models using a laplacian pyramid of adversarial networks. In
Advances in neural information processing systems, pages 1486–1494,
2015.
[2] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style
transfer using convolutional neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages
2414–2423, 2016.
[3] Mingming He, Jing Liao, Lu Yuan, and Pedro V Sander. Neural color
transfer between images. arXiv preprint arXiv:1710.00756, 2017.
[4] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh,
Pablo Sprechmann, and Yann LeCun. Disentangling factors of varia-
tion in deep representation using adversarial training. In Advances in
Neural Information Processing Systems, pages 5040–5048, 2016.
[5] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised repre-
sentation learning with deep convolutional generative adversarial net-
works. arXiv preprint arXiv:1511.06434, 2015.
[6] Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. Artistic style
transfer for videos. In German Conference on Pattern Recognition,
pages 26–36. Springer, 2016.
[7] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec
Radford, and Xi Chen. Improved techniques for training gans. In
Advances in Neural Information Processing Systems, pages 2234–2242,
2016.
[8] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew
Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv
preprint arXiv:1808.06601, 2018.
15
16. [9] Jun-Yan Zhu, Philipp Kr¨ahenb¨uhl, Eli Shechtman, and Alexei A Efros.
Generative visual manipulation on the natural image manifold. In
European Conference on Computer Vision, pages 597–613. Springer,
2016.
[10] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Un-
paired image-to-image translation using cycle-consistent adversarial
networks. arXiv preprint, 2017.
[11] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Un-
paired image-to-image translation using cycle-consistent adversarial
networkss. In Computer Vision (ICCV), 2017 IEEE International
Conference on, 2017.
16