Learning object dynamics in video generation

Learning object dynamics in video generation
Anant Gupta

Motivation
● Unsupervised Video Generation is an important problem
● Recent progress opened up the venue for the next milestone of challenges
lying in the area
● In this work, we take up one of the best performing models and try to address
and solve these challenges
● We realise the current metrics do not suffice to measure the progress of our
models

Methods used previously
● Direct pixel-level prediction
● Learning a geometric transformation function
● Autoregressive generation
● Supervised learning
● Adversarial methods
● Learning distribution for uncertainty from
○ Residual Error
○ Past frames

Stochastic Video Generation (Baseline)
● The uncertainty in the future frames is learned as a prior distribution
● This is then combined with the deterministic part
●

Issues
● Generation of interaction between objects
● Generation of previously occluded scenes

Methods
1. Hierarchical Latent Model
○ Latent variables at each layer in the hierarchy are dependent on the previous ones
○ Latent variables in each layer learn uncertainties lying at particular frequency levels
○ Similar to multi-scale signal representation in Computer Vision
○ Training can be done jointly or layer-wise
2. Pixel-Level Masking:
○ Hard Negative Sampling of pixel-level prediction errors.

Experiments and Evaluation
● Model Variants
○ Trained layer wise (HLM1)
○ Trained jointly (HLM2)
○ Pixel wise masked L2 loss with LR decay (PM1)
○ Pixel wise masked L1 loss with LR decay (PM2)
● Models initialized with pretrained baseline model
● Dataset: BAIR Robot Push dataset
● Evaluation Methods:
○ Peak Signal to Noise Ratio (PSNR)
○ Structural Similarity (SSIM)
○ Qualitative Analysis

Results
● HLM1 beats the baseline model for later timesteps in SSIM

Results
Ground Truth
HLM1
Baseline

References
1. E. Denton and R. Fergus, “Stochastic video generation with a learned prior,”
arXiv preprint arXiv:1802.07687, 2018.
2. M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine,
“Stochastic variational video prediction,” arXiv preprint arXiv:1710.11252,
2017.
3. F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised visual planning
with temporal skip connections,” arXiv preprint arXiv:1710.05268, 2017.
4. C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical
interaction through video prediction,” in Advances in neural information
processing systems, pp. 64–72, 2016.

Learning object dynamics in video generation

More Related Content

Similar to Learning object dynamics in video generation

Recently uploaded

Learning object dynamics in video generation