[AAAI 2021] Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Differential Equation

Vid-ODE: Continuous-Time Video Generation
with Neural Ordinary Differential Equation
Sunghyun Park1*, Kangyeol Kim1*, Junsoo Lee1
Jaegul Choo1, Joonseok Lee2, Sookyung Kim3, Edward Choi1
1Korea Advanced Institute of Science and Technology (KAIST),
2Google Research, 3Lawrence Livermore Nat'l Lab.
1

Motivations
• Videos, the recording of continuous flow of visual information, inevitably di
scretize the continuous time into a predefined, finite number of units.
• It is challenging for video generation models to accept irregularly sample
d frames or generate frames at unseen timesteps.
2
Regular Video Frames
0 1 2 3 Time
0.5 1.4 2.7 3.8 4.3 4.8
Arbitrary Video Frames

Motivations
• We aim to learn the continuous flow of videos from a sequence of
frames (either regular or irregular) and synthesize new frames at any
given timesteps.
3
Continuous-time Video Generation

Importance of Continuous Generation
• Due to the equipment cost, the time interval per each measurement of cli
mate video often spans minutes to hours, which is insufficient to capture th
e target dynamics.
• Datasets collected from a wild environment is frequently missing values,
which is in turn results in irregular timesteps
4
0H 3H 6H Time
9H
Missing Frame
Unobserved Time

Introduction to Neural ODE
• Reformulation of ResNet forward pass using an integral.
• Interpreting the formula as solving an Ordinary Differential Equation (ODE),
where neural networks serves as a derivative estimator.
• The essence of neural ODE lies in learning continuous dynamics. And our pa
per aims at modeling the continuous dynamics of a video.
5
Neural Ordinary Differential Equations, Chen et al, NIPS 2018

Introduction to Neural ODE
Latent ODEs for Irregularly-SampledTime Series, Rubanova et al, NeurIPS 2019 6
• Continuous time-series prediction (Interpolation / Extrapolation)
• Latent-ODE with an ODE-RNN encoder can handle irregularly-sampled
time-series data.

Video Generation via Neural ODE
• Previous approach towards video generation via Neural ODE: ODE²VAE
• ODE²VAE decompose the latent representations into the position and the
momentum to model the continuous dynamics.
ODE²VAE: Deep generative second order ODEs with Bayesian neural networks,Yıldız et al, NeurIPS 2019 7

Limitation of ODE²VAE
• Although ODE²VAE shows some promising directions in continuous time-s
eries modeling, it is still an unanswered question whether they can scale to
perform continuous-time video generation on complicate real-world vide
os.
8
ODE²VAE: Deep generative second order ODEs with Bayesian neural networks,Yıldız et al, NeurIPS 2019

Relationship to Existing Video Models
• Video Interpolation and extrapolation aim at generating an in-between a
nd future frame respectively given a set of video frames.
• Technically, existing approaches take advantage of warping operation or pi
xel-wise prediction for video generation.
• Since most existing models for the tasks rely on the supervision signals
defined at fixed timesteps (i.e., in-between and future frame recorded in
discretized manner), they have a limitation on generating at arbitrary
timesteps.
• In this paper, we address the limitation by combining neural ODE with
various vision methods and propose a novel framework Vid-ODE for
continuous-time video generation
9

Key Contributions
• Vid-ODE can predict video frames at any given timesteps (both within
and beyond the observed range).
• This is the first ODE-based framework to successfully perform
continuous-time video generation on real-world videos.
• Vid-ODE can flexibly handle unrestricted by pre-defined time intervals
over the several variants of ConvGRU and neural ODE on climate videos
where data are sparsely collected.
10

Overview of Vid-ODE
11
Encoder Decoder

Proposed Method: Encoder
• Prior works employ FC layers to model the derivative of the latent state .
• For encoder, we propose ODE-ConvGRU, a combination of neural ODE
and ConvGRU, to handle the spatial aspect of given video frames.
• In specific, ODE-ConvGRU processes 3D tensors with the aid of convolutional blocks,
preserving spatial information.
12
versus
ODE-RNN ODE-ConvGRU

Proposed Method: Decoder
• Our decoder consists of an ODE solver and the Conv-Decoder .
• ODE solver produces the hidden states by integrating at given timesteps.
• By taking adjacent hidden states , Conv Decoder outputs Optical
Flow , Image Difference , and Composition Mask .
13

Linear Composition
• Three outputs of decoder are combined via the convex combination .
14
Combination procedure
Visualization of 3 intermediate outputs

Total Objective Functions
• We adopt image and sequence discriminators to improve the output
quality by using adversarial losses.
• computes the pixel-level distance between the predicted video
and the ground truth.
• helps the model learn the image difference as the pixel-wise
difference between consecutive video frames.
• To sum up, total objective function can be written as
15

Experimental Setup
• Dataset
• KTH Action: the videos of 25 subjects performing 6 different types of actions.
• Penn Action: the videos of humans playing sports.
• Moving GIF: the videos of animated animal characters.
• CAM5: a hurricane video dataset for evaluating irregularly-sampled video prediction.
• Bouncing Ball: the videos containing three balls moving in different directions.
• Evaluation Metric
• Structural Similarity (SSIM)
• Peak Signal-to-Noise Ratio (PSNR)
• Learned Perceptual image Path Similarity (LPIPS)
16

Neural-ODE Comparison
• Table shows that Vid-ODE significantly outperforms all other baselines
both in interpolation and extrapolation tasks.
17

Neural-ODE Comparison
18
Video interpolation results (KTH Action)
Video extrapolation results (Bouncing ball)

Video Interpolation
19
• As expected, we see some gap between Vid-ODE and the supervised
approach (Deep Voxel Flow (DVI)).
• However, Vid-ODE outperforms Unsupervised Video Interpolation (UVI) in
all cases (especially in MGIF), except for SSIM in KTH-Action.
Unsupervised
Supervised

Video Interpolation
20
Video interpolation results (Penn Action)
Continuous video interpolation (Moving GIF)

Video Extrapolation
• As shown in table, Vid-ODE significantly outperforms all other baseline
models in all metrics.
• It is noteworthy that the performance gap is wider for moving GIF, which
contains more dynamic object movements, indicating Vid-ODE’s superior
ability to learn complex dynamics.
21

Video Extrapolation
22
Continuous video extrapolation (KTH Action)
Video extrapolation results (Penn Action)

Irregular Video Prediction
23
• We use hurricane dataset (CAM5) to test Vid-ODE’s ability to cope with
irregularly sampled input.
• Table shows that Vid-ODE has the ability to process irregularly sampled
video frames.
• In addition, we measure MSE and LPIPS on CAM5 dataset while changing
the input’s sampling rate to evaluate the effect of irregularity.

Irregular Video Prediction
24
Irregular video prediction results (CAM5)

RNN vs ODE
• To emphasize the need for learning the continuous video dynamics using
the ODE, we compare Vid-ODE to Vid-RNN.
• Vid-RNN replaces ODE components in Vid-ODE with ConvGRU while
retaining all other components.
25
versus

RNN vs ODE
• Vid-ODE is successfully inferring video frames at unseen timesteps thanks
to learning the underlying video dynamics.
• Vid-RNN generates unrealistic video frames due to simply blending two
adjacent latent representations.
26

Conclusion
27
• We propose Vid-ODE which enjoys the continuous nature of neural ODEs
to generate video frames at any given timesteps.
• We demonstrate its ability to generate high-quality video frames in the cont
inuous-time domain using four real-world video datasets.
• In future work, we plan to study how to adopt a flexible structure to addres
s the auto-regressive architecture of Vid-ODE.

[AAAI 2021] Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Differential Equation

Recommended

Recommended

More Related Content

Similar to [AAAI 2021] Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Differential Equation

Similar to [AAAI 2021] Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Differential Equation (20)

Recently uploaded

Recently uploaded (20)

[AAAI 2021] Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Differential Equation