Disentangle motion, Foreground and Background Features in Videos

Disentangling Motion, Foreground and
Background Features in Videos
Slides by Xunyu Lin
ReadAI, UPC
29th May, 2017
Xunyu Lin, Victor Campos, Xavier Giro-i-Nieto, Jordi
Torres, Cristian Canton Ferrer
[paper] (15 May 2017) [code] [demo]

Index
1. Introduction
2. Motivation
3. Disentangle three features
4. Loss definition
5. Experiments
6. Conclusions

Introduction
Unsupervised video features learning
● Why unsupervised learning?
“Most of human and animal learning is unsupervised learning. If
intelligence was a cake, unsupervised learning would be the cake,
supervised learning would be the icing on the cake, and
reinforcement learning would be the cherry on the cake. We know
how to make the icing and the cherry, but we don’t know how to make
the cake.”
—— Yann LeCun

Introduction
Unsupervised video features learning
● Explosion of video data:
○ Estimated number amount of new videos uploaded to YouTube every minute:
■ 400 hours
● Human and animal learning is largely unsupervised: we discover the structure of the world by
observing it, not by being told the name of every object

Motivation
How human summarize the videos?
● Three key components:
○ What are the objects of interest? (foreground)
○ Where is this happening? (background)
○ What are they doing? (foreground motion)

Motivation
Biological inspiration
● How infants percept the world without any prior knowledge?
○ Infants divide perceptual arrays into units that move as connected wholes, that move
separately from one another, that tend to maintain their size and shape over motion, and
that tend to act upon each other only on contact.

Motivation
Supervised segmentation from motion cues (NLC)
● Video Segmentation by Non-Local Consensus Voting (BMVC 2014)

Motivation
Unsupervised segmentation from motion cues (uNLC)
● Learning Features by Watching Objects Move (CVPR 2017)
○ Unsupervised adaptation from NLC method

Disentangle three features
Disentangling of foreground and background
C3D
Foreground
Motion
Foreground
Background
DecDec
Reconstruction
of foreground in
first frame
Reconstruction
of background
in first frame
uNLC
Mask

Model of motion features
● Model the motion as the task of updating foreground appearances in
different frames
Foreground
feature
Motion
feature

Original architecture
C3D
Foreground
Motion
Foreground
Background
DecDecDec
Reconstruction
of foreground in
last frame
Reconstruction
of foreground in
first frame
Reconstruction
of background
in first frame
uNLC
Mask

Original architecture
C3D
Foreground
Motion
Foreground
Background
DecDecDec
Reconstruction
of foreground in
last frame
Reconstruction
of foreground in
first frame
Reconstruction
of background
in first frame
uNLC
Mask
Network may learn to cheat
by storing last foreground
appearance feature here,
instead of features we want

Better modelling of motion features
kernel dec
C3D
Foreground
Motion
First
Foreground
Background
Fg
Dec
Bg
Dec
Fg
Dec
Reconstruction
of foreground in
last frame
Reconstruction
of foreground in
first frame
Reconstruction
of background
in first frame
uNLC
Mask
Cross conv in
feature space
Block
gradients
Last
foreground
Kernels
share
weights

Loss definition
Reconstruction loss with loss mask
● Adaption of loss:
○ Original loss = |reconstruction - ground_truth| (L1 loss)
○ Masked loss = |reconstruction - ground_truth| * loss_mask
1 1 1 1
1 1
1 1
1 1 1 1
S(background) / S(foreground)
*
Prediction Ground truth
L1 Loss Loss mask

Loss definition
Feature loss
● Reconstruction of last foreground was always blurry
○ Last foreground reconstruction is dependant on first foreground feature learning
○ Cross convolution

Loss definition
Feature loss
kernel dec
C3D
Foreground
Motion
First
Foreground
Background
Fg
Dec
Bg
Dec
Fg
Dec
Reconstruction
of foreground in
last frame
Reconstruction
of foreground in
first frame
Reconstruction
of background
in first frame
uNLC
Mask
Cross conv in
feature space
Block
gradients
Last
foreground
Kernels
share
weights
Feature loss L1 loss

Loss definition
Feature loss
● Feature loss introduces supervision in deep feature space
● How to get the ground truth feature of last foreground in an
unsupervised manner?
○ We observed a better performance of first foreground reconstruction
○ First foreground feature can be simply leveraged as pseudo ground truth by temporally
flipping the input clip

Experiment
Sanity check
● With(out) loss mask
Without loss mask With loss mask

Experiment
Trained on the subset of UCF-101 (with localization annotations)

Experiment
Discriminative task (action recognition)
● Architecture
C3D
Foreground
Motion
First
Foreground
Background
softmax

Experiment
Discriminative task (action recognition)
● Result (random initialization vs our pretrained method)

Conclusion
Our contributions
● Our method successfully simulates human perceptual grouping
mechanism from motion cues
● Proposed method can successfully learn more generalized and richer
video representations by disentangling motion, foreground and
background
● Given the small amount of data used for pre-training, our method is
promising in rendering better results with larger dataset

Conclusion
Future work
● Introduce unsupervised learning for foreground segmentation as
proposed in uNLC
● Train with a larger amount of unlabeled data
● Introduce adversarial loss to improve the sharpness of the reconstructed
frames
● Fill the gap of absent motion features between the first frame and the last
frame by reconstructing any random frame in the clip

Disentangle motion, Foreground and Background Features in Videos

More Related Content

What's hot

Similar to Disentangle motion, Foreground and Background Features in Videos

More from Universitat Politècnica de Catalunya

Recently uploaded

Disentangle motion, Foreground and Background Features in Videos