Learning object dynamics in video generation
Anant Gupta
Motivation
● Unsupervised Video Generation is an important problem
● Recent progress opened up the venue for the next milestone of challenges
lying in the area
● In this work, we take up one of the best performing models and try to address
and solve these challenges
● We realise the current metrics do not suffice to measure the progress of our
models
Methods used previously
● Direct pixel-level prediction
● Learning a geometric transformation function
● Autoregressive generation
● Supervised learning
● Adversarial methods
● Learning distribution for uncertainty from
○ Residual Error
○ Past frames
Stochastic Video Generation (Baseline)
● The uncertainty in the future frames is learned as a prior distribution
● This is then combined with the deterministic part
●
Issues
● Generation of interaction between objects
● Generation of previously occluded scenes
Methods
1. Hierarchical Latent Model
○ Latent variables at each layer in the hierarchy are dependent on the previous ones
○ Latent variables in each layer learn uncertainties lying at particular frequency levels
○ Similar to multi-scale signal representation in Computer Vision
○ Training can be done jointly or layer-wise
2. Pixel-Level Masking:
○ Hard Negative Sampling of pixel-level prediction errors.
Experiments and Evaluation
● Model Variants
○ Trained layer wise (HLM1)
○ Trained jointly (HLM2)
○ Pixel wise masked L2 loss with LR decay (PM1)
○ Pixel wise masked L1 loss with LR decay (PM2)
● Models initialized with pretrained baseline model
● Dataset: BAIR Robot Push dataset
● Evaluation Methods:
○ Peak Signal to Noise Ratio (PSNR)
○ Structural Similarity (SSIM)
○ Qualitative Analysis
Results
● HLM1 beats the baseline model for later timesteps in SSIM
Results
Ground Truth
HLM1
Baseline
Results
References
1. E. Denton and R. Fergus, “Stochastic video generation with a learned prior,”
arXiv preprint arXiv:1802.07687, 2018.
2. M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine,
“Stochastic variational video prediction,” arXiv preprint arXiv:1710.11252,
2017.
3. F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised visual planning
with temporal skip connections,” arXiv preprint arXiv:1710.05268, 2017.
4. C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical
interaction through video prediction,” in Advances in neural information
processing systems, pp. 64–72, 2016.

Learning object dynamics in video generation

  • 1.
    Learning object dynamicsin video generation Anant Gupta
  • 2.
    Motivation ● Unsupervised VideoGeneration is an important problem ● Recent progress opened up the venue for the next milestone of challenges lying in the area ● In this work, we take up one of the best performing models and try to address and solve these challenges ● We realise the current metrics do not suffice to measure the progress of our models
  • 3.
    Methods used previously ●Direct pixel-level prediction ● Learning a geometric transformation function ● Autoregressive generation ● Supervised learning ● Adversarial methods ● Learning distribution for uncertainty from ○ Residual Error ○ Past frames
  • 4.
    Stochastic Video Generation(Baseline) ● The uncertainty in the future frames is learned as a prior distribution ● This is then combined with the deterministic part ●
  • 5.
    Issues ● Generation ofinteraction between objects ● Generation of previously occluded scenes
  • 6.
    Methods 1. Hierarchical LatentModel ○ Latent variables at each layer in the hierarchy are dependent on the previous ones ○ Latent variables in each layer learn uncertainties lying at particular frequency levels ○ Similar to multi-scale signal representation in Computer Vision ○ Training can be done jointly or layer-wise 2. Pixel-Level Masking: ○ Hard Negative Sampling of pixel-level prediction errors.
  • 7.
    Experiments and Evaluation ●Model Variants ○ Trained layer wise (HLM1) ○ Trained jointly (HLM2) ○ Pixel wise masked L2 loss with LR decay (PM1) ○ Pixel wise masked L1 loss with LR decay (PM2) ● Models initialized with pretrained baseline model ● Dataset: BAIR Robot Push dataset ● Evaluation Methods: ○ Peak Signal to Noise Ratio (PSNR) ○ Structural Similarity (SSIM) ○ Qualitative Analysis
  • 8.
    Results ● HLM1 beatsthe baseline model for later timesteps in SSIM
  • 9.
  • 10.
  • 11.
    References 1. E. Dentonand R. Fergus, “Stochastic video generation with a learned prior,” arXiv preprint arXiv:1802.07687, 2018. 2. M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, “Stochastic variational video prediction,” arXiv preprint arXiv:1710.11252, 2017. 3. F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised visual planning with temporal skip connections,” arXiv preprint arXiv:1710.05268, 2017. 4. C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in Advances in neural information processing systems, pp. 64–72, 2016.