Title: Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Differential Equation (AAAI 2021)
Authors: Sunghyun Park*, Kangyeol Kim*, Junsoo Lee, Jaegul Choo, Joonseok Lee, Sookyung Kim, Edward Choi (*: equal contributions)
Abstract:
Video generation models often operate under the assumption of fixed frame rates, which leads to suboptimal performance when it comes to handling flexible frame rates (e.g., increasing the frame rate of the more dynamic portion of the video as well as handling missing video frames). To resolve the restricted nature of existing video generation models' ability to handle arbitrary timesteps, we propose continuous-time video generation by combining neural ODE (Vid-ODE) with pixel-level video processing techniques. Using ODE-ConvGRU as an encoder, a convolutional version of the recently proposed neural ODE, which enables us to learn continuous-time dynamics, Vid-ODE can learn the spatio-temporal dynamics of input videos of flexible frame rates. The decoder integrates the learned dynamics function to synthesize video frames at any given timesteps, where the pixel-level composition technique is used to maintain the sharpness of individual frames. With extensive experiments on four real-world video datasets, we verify that the proposed Vid-ODE outperforms state-of-the-art approaches under various video generation settings, both within the trained time range (interpolation) and beyond the range (extrapolation). To the best of our knowledge, Vid-ODE is the first work successfully performing continuous-time video generation using real-world videos.
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
[AAAI 2021] Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Differential Equation
1. Vid-ODE: Continuous-Time Video Generation
with Neural Ordinary Differential Equation
Sunghyun Park1*, Kangyeol Kim1*, Junsoo Lee1
Jaegul Choo1, Joonseok Lee2, Sookyung Kim3, Edward Choi1
1Korea Advanced Institute of Science and Technology (KAIST),
2Google Research, 3Lawrence Livermore Nat'l Lab.
1
2. Motivations
• Videos, the recording of continuous flow of visual information, inevitably di
scretize the continuous time into a predefined, finite number of units.
• It is challenging for video generation models to accept irregularly sample
d frames or generate frames at unseen timesteps.
2
Regular Video Frames
0 1 2 3 Time
0.5 1.4 2.7 3.8 4.3 4.8
Arbitrary Video Frames
3. Motivations
• We aim to learn the continuous flow of videos from a sequence of
frames (either regular or irregular) and synthesize new frames at any
given timesteps.
3
Continuous-time Video Generation
4. Importance of Continuous Generation
• Due to the equipment cost, the time interval per each measurement of cli
mate video often spans minutes to hours, which is insufficient to capture th
e target dynamics.
• Datasets collected from a wild environment is frequently missing values,
which is in turn results in irregular timesteps
4
0H 3H 6H Time
9H
Missing Frame
Unobserved Time
5. Introduction to Neural ODE
• Reformulation of ResNet forward pass using an integral.
• Interpreting the formula as solving an Ordinary Differential Equation (ODE),
where neural networks serves as a derivative estimator.
• The essence of neural ODE lies in learning continuous dynamics. And our pa
per aims at modeling the continuous dynamics of a video.
5
Neural Ordinary Differential Equations, Chen et al, NIPS 2018
6. Introduction to Neural ODE
Latent ODEs for Irregularly-SampledTime Series, Rubanova et al, NeurIPS 2019 6
• Continuous time-series prediction (Interpolation / Extrapolation)
• Latent-ODE with an ODE-RNN encoder can handle irregularly-sampled
time-series data.
7. Video Generation via Neural ODE
• Previous approach towards video generation via Neural ODE: ODE²VAE
• ODE²VAE decompose the latent representations into the position and the
momentum to model the continuous dynamics.
ODE²VAE: Deep generative second order ODEs with Bayesian neural networks,Yıldız et al, NeurIPS 2019 7
8. Limitation of ODE²VAE
• Although ODE²VAE shows some promising directions in continuous time-s
eries modeling, it is still an unanswered question whether they can scale to
perform continuous-time video generation on complicate real-world vide
os.
8
ODE²VAE: Deep generative second order ODEs with Bayesian neural networks,Yıldız et al, NeurIPS 2019
9. Relationship to Existing Video Models
• Video Interpolation and extrapolation aim at generating an in-between a
nd future frame respectively given a set of video frames.
• Technically, existing approaches take advantage of warping operation or pi
xel-wise prediction for video generation.
• Since most existing models for the tasks rely on the supervision signals
defined at fixed timesteps (i.e., in-between and future frame recorded in
discretized manner), they have a limitation on generating at arbitrary
timesteps.
• In this paper, we address the limitation by combining neural ODE with
various vision methods and propose a novel framework Vid-ODE for
continuous-time video generation
9
10. Key Contributions
• Vid-ODE can predict video frames at any given timesteps (both within
and beyond the observed range).
• This is the first ODE-based framework to successfully perform
continuous-time video generation on real-world videos.
• Vid-ODE can flexibly handle unrestricted by pre-defined time intervals
over the several variants of ConvGRU and neural ODE on climate videos
where data are sparsely collected.
10
12. Proposed Method: Encoder
• Prior works employ FC layers to model the derivative of the latent state .
• For encoder, we propose ODE-ConvGRU, a combination of neural ODE
and ConvGRU, to handle the spatial aspect of given video frames.
• In specific, ODE-ConvGRU processes 3D tensors with the aid of convolutional blocks,
preserving spatial information.
12
versus
ODE-RNN ODE-ConvGRU
13. Proposed Method: Decoder
• Our decoder consists of an ODE solver and the Conv-Decoder .
• ODE solver produces the hidden states by integrating at given timesteps.
• By taking adjacent hidden states , Conv Decoder outputs Optical
Flow , Image Difference , and Composition Mask .
13
14. Linear Composition
• Three outputs of decoder are combined via the convex combination .
14
Combination procedure
Visualization of 3 intermediate outputs
15. Total Objective Functions
• We adopt image and sequence discriminators to improve the output
quality by using adversarial losses.
• computes the pixel-level distance between the predicted video
and the ground truth.
• helps the model learn the image difference as the pixel-wise
difference between consecutive video frames.
• To sum up, total objective function can be written as
15
16. Experimental Setup
• Dataset
• KTH Action: the videos of 25 subjects performing 6 different types of actions.
• Penn Action: the videos of humans playing sports.
• Moving GIF: the videos of animated animal characters.
• CAM5: a hurricane video dataset for evaluating irregularly-sampled video prediction.
• Bouncing Ball: the videos containing three balls moving in different directions.
• Evaluation Metric
• Structural Similarity (SSIM)
• Peak Signal-to-Noise Ratio (PSNR)
• Learned Perceptual image Path Similarity (LPIPS)
16
17. Neural-ODE Comparison
• Table shows that Vid-ODE significantly outperforms all other baselines
both in interpolation and extrapolation tasks.
17
19. Video Interpolation
19
• As expected, we see some gap between Vid-ODE and the supervised
approach (Deep Voxel Flow (DVI)).
• However, Vid-ODE outperforms Unsupervised Video Interpolation (UVI) in
all cases (especially in MGIF), except for SSIM in KTH-Action.
Unsupervised
Supervised
21. Video Extrapolation
• As shown in table, Vid-ODE significantly outperforms all other baseline
models in all metrics.
• It is noteworthy that the performance gap is wider for moving GIF, which
contains more dynamic object movements, indicating Vid-ODE’s superior
ability to learn complex dynamics.
21
23. Irregular Video Prediction
23
• We use hurricane dataset (CAM5) to test Vid-ODE’s ability to cope with
irregularly sampled input.
• Table shows that Vid-ODE has the ability to process irregularly sampled
video frames.
• In addition, we measure MSE and LPIPS on CAM5 dataset while changing
the input’s sampling rate to evaluate the effect of irregularity.
25. RNN vs ODE
• To emphasize the need for learning the continuous video dynamics using
the ODE, we compare Vid-ODE to Vid-RNN.
• Vid-RNN replaces ODE components in Vid-ODE with ConvGRU while
retaining all other components.
25
versus
26. RNN vs ODE
• Vid-ODE is successfully inferring video frames at unseen timesteps thanks
to learning the underlying video dynamics.
• Vid-RNN generates unrealistic video frames due to simply blending two
adjacent latent representations.
26
27. Conclusion
27
• We propose Vid-ODE which enjoys the continuous nature of neural ODEs
to generate video frames at any given timesteps.
• We demonstrate its ability to generate high-quality video frames in the cont
inuous-time domain using four real-world video datasets.
• In future work, we plan to study how to adopt a flexible structure to addres
s the auto-regressive architecture of Vid-ODE.