Disentangling Motion, Foreground and
Background Features in Videos
Slides by Xunyu Lin
ReadAI, UPC
29th May, 2017
Xunyu Lin, Victor Campos, Xavier Giro-i-Nieto, Jordi
Torres, Cristian Canton Ferrer
[paper] (15 May 2017) [code] [demo]
Index
1. Introduction
2. Motivation
3. Disentangle three features
4. Loss definition
5. Experiments
6. Conclusions
Index
1. Introduction
2. Motivation
3. Disentangle three features
4. Loss definition
5. Experiments
6. Conclusions
Introduction
Unsupervised video features learning
● Why unsupervised learning?
“Most of human and animal learning is unsupervised learning. If
intelligence was a cake, unsupervised learning would be the cake,
supervised learning would be the icing on the cake, and
reinforcement learning would be the cherry on the cake. We know
how to make the icing and the cherry, but we don’t know how to make
the cake.”
—— Yann LeCun
Introduction
Unsupervised video features learning
● Explosion of video data:
○ Estimated number amount of new videos uploaded to YouTube every minute:
■ 400 hours
● Human and animal learning is largely unsupervised: we discover the structure of the world by
observing it, not by being told the name of every object
Index
1. Introduction
2. Motivation
3. Disentangle three features
4. Loss definition
5. Experiments
6. Conclusions
Motivation
How human summarize the videos?
● Three key components:
○ What are the objects of interest? (foreground)
○ Where is this happening? (background)
○ What are they doing? (foreground motion)
Motivation
Biological inspiration
● How infants percept the world without any prior knowledge?
○ Infants divide perceptual arrays into units that move as connected wholes, that move
separately from one another, that tend to maintain their size and shape over motion, and
that tend to act upon each other only on contact.
Motivation
Supervised segmentation from motion cues (NLC)
● Video Segmentation by Non-Local Consensus Voting (BMVC 2014)
Motivation
Unsupervised segmentation from motion cues (uNLC)
● Learning Features by Watching Objects Move (CVPR 2017)
○ Unsupervised adaptation from NLC method
Index
1. Introduction
2. Motivation
3. Disentangle three features
4. Loss definition
5. Experiments
6. Conclusions
Disentangle three features
Disentangling of foreground and background
C3D
Foreground
Motion
Foreground
Background
DecDec
Reconstruction
of foreground in
first frame
Reconstruction
of background
in first frame
uNLC
Mask
Disentangle three features
Model of motion features
● Model the motion as the task of updating foreground appearances in
different frames
Foreground
feature
Motion
feature
Disentangle three features
Original architecture
C3D
Foreground
Motion
Foreground
Background
DecDecDec
Reconstruction
of foreground in
last frame
Reconstruction
of foreground in
first frame
Reconstruction
of background
in first frame
uNLC
Mask
Disentangle three features
Original architecture
C3D
Foreground
Motion
Foreground
Background
DecDecDec
Reconstruction
of foreground in
last frame
Reconstruction
of foreground in
first frame
Reconstruction
of background
in first frame
uNLC
Mask
Network may learn to cheat
by storing last foreground
appearance feature here,
instead of features we want
Disentangle three features
Better modelling of motion features
kernel dec
C3D
Foreground
Motion
First
Foreground
Background
Fg
Dec
Bg
Dec
Fg
Dec
Reconstruction
of foreground in
last frame
Reconstruction
of foreground in
first frame
Reconstruction
of background
in first frame
uNLC
Mask
Cross conv in
feature space
Block
gradients
Last
foreground
Kernels
share
weights
Index
1. Introduction
2. Motivation
3. Disentangle three features
4. Loss definition
5. Experiments
6. Conclusions
Loss definition
Reconstruction loss with loss mask
● Adaption of loss:
○ Original loss = |reconstruction - ground_truth| (L1 loss)
○ Masked loss = |reconstruction - ground_truth| * loss_mask
1 1 1 1
1 1
1 1
1 1 1 1
S(background) / S(foreground)
*
Prediction Ground truth
L1 Loss Loss mask
Loss definition
Feature loss
● Reconstruction of last foreground was always blurry
○ Last foreground reconstruction is dependant on first foreground feature learning
○ Cross convolution
Loss definition
Feature loss
kernel dec
C3D
Foreground
Motion
First
Foreground
Background
Fg
Dec
Bg
Dec
Fg
Dec
Reconstruction
of foreground in
last frame
Reconstruction
of foreground in
first frame
Reconstruction
of background
in first frame
uNLC
Mask
Cross conv in
feature space
Block
gradients
Last
foreground
Kernels
share
weights
Feature loss L1 loss
Loss definition
Feature loss
● Feature loss introduces supervision in deep feature space
● How to get the ground truth feature of last foreground in an
unsupervised manner?
○ We observed a better performance of first foreground reconstruction
○ First foreground feature can be simply leveraged as pseudo ground truth by temporally
flipping the input clip
Index
1. Introduction
2. Motivation
3. Disentangle three features
4. Loss definition
5. Experiments
6. Conclusions
Experiment
Sanity check
● With(out) loss mask
Without loss mask With loss mask
Experiment
Trained on the subset of UCF-101 (with localization annotations)
Experiment
Discriminative task (action recognition)
● Architecture
C3D
Foreground
Motion
First
Foreground
Background
softmax
Experiment
Discriminative task (action recognition)
● Result (random initialization vs our pretrained method)
Conclusion
Our contributions
● Our method successfully simulates human perceptual grouping
mechanism from motion cues
● Proposed method can successfully learn more generalized and richer
video representations by disentangling motion, foreground and
background
● Given the small amount of data used for pre-training, our method is
promising in rendering better results with larger dataset
Conclusion
Future work
● Introduce unsupervised learning for foreground segmentation as
proposed in uNLC
● Train with a larger amount of unlabeled data
● Introduce adversarial loss to improve the sharpness of the reconstructed
frames
● Fill the gap of absent motion features between the first frame and the last
frame by reconstructing any random frame in the clip

Disentangle motion, Foreground and Background Features in Videos

  • 1.
    Disentangling Motion, Foregroundand Background Features in Videos Slides by Xunyu Lin ReadAI, UPC 29th May, 2017 Xunyu Lin, Victor Campos, Xavier Giro-i-Nieto, Jordi Torres, Cristian Canton Ferrer [paper] (15 May 2017) [code] [demo]
  • 2.
    Index 1. Introduction 2. Motivation 3.Disentangle three features 4. Loss definition 5. Experiments 6. Conclusions
  • 3.
    Index 1. Introduction 2. Motivation 3.Disentangle three features 4. Loss definition 5. Experiments 6. Conclusions
  • 4.
    Introduction Unsupervised video featureslearning ● Why unsupervised learning? “Most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake.” —— Yann LeCun
  • 5.
    Introduction Unsupervised video featureslearning ● Explosion of video data: ○ Estimated number amount of new videos uploaded to YouTube every minute: ■ 400 hours ● Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object
  • 6.
    Index 1. Introduction 2. Motivation 3.Disentangle three features 4. Loss definition 5. Experiments 6. Conclusions
  • 7.
    Motivation How human summarizethe videos? ● Three key components: ○ What are the objects of interest? (foreground) ○ Where is this happening? (background) ○ What are they doing? (foreground motion)
  • 8.
    Motivation Biological inspiration ● Howinfants percept the world without any prior knowledge? ○ Infants divide perceptual arrays into units that move as connected wholes, that move separately from one another, that tend to maintain their size and shape over motion, and that tend to act upon each other only on contact.
  • 9.
    Motivation Supervised segmentation frommotion cues (NLC) ● Video Segmentation by Non-Local Consensus Voting (BMVC 2014)
  • 10.
    Motivation Unsupervised segmentation frommotion cues (uNLC) ● Learning Features by Watching Objects Move (CVPR 2017) ○ Unsupervised adaptation from NLC method
  • 11.
    Index 1. Introduction 2. Motivation 3.Disentangle three features 4. Loss definition 5. Experiments 6. Conclusions
  • 12.
    Disentangle three features Disentanglingof foreground and background C3D Foreground Motion Foreground Background DecDec Reconstruction of foreground in first frame Reconstruction of background in first frame uNLC Mask
  • 13.
    Disentangle three features Modelof motion features ● Model the motion as the task of updating foreground appearances in different frames Foreground feature Motion feature
  • 14.
    Disentangle three features Originalarchitecture C3D Foreground Motion Foreground Background DecDecDec Reconstruction of foreground in last frame Reconstruction of foreground in first frame Reconstruction of background in first frame uNLC Mask
  • 15.
    Disentangle three features Originalarchitecture C3D Foreground Motion Foreground Background DecDecDec Reconstruction of foreground in last frame Reconstruction of foreground in first frame Reconstruction of background in first frame uNLC Mask Network may learn to cheat by storing last foreground appearance feature here, instead of features we want
  • 16.
    Disentangle three features Bettermodelling of motion features kernel dec C3D Foreground Motion First Foreground Background Fg Dec Bg Dec Fg Dec Reconstruction of foreground in last frame Reconstruction of foreground in first frame Reconstruction of background in first frame uNLC Mask Cross conv in feature space Block gradients Last foreground Kernels share weights
  • 17.
    Index 1. Introduction 2. Motivation 3.Disentangle three features 4. Loss definition 5. Experiments 6. Conclusions
  • 18.
    Loss definition Reconstruction losswith loss mask ● Adaption of loss: ○ Original loss = |reconstruction - ground_truth| (L1 loss) ○ Masked loss = |reconstruction - ground_truth| * loss_mask 1 1 1 1 1 1 1 1 1 1 1 1 S(background) / S(foreground) * Prediction Ground truth L1 Loss Loss mask
  • 19.
    Loss definition Feature loss ●Reconstruction of last foreground was always blurry ○ Last foreground reconstruction is dependant on first foreground feature learning ○ Cross convolution
  • 20.
    Loss definition Feature loss kerneldec C3D Foreground Motion First Foreground Background Fg Dec Bg Dec Fg Dec Reconstruction of foreground in last frame Reconstruction of foreground in first frame Reconstruction of background in first frame uNLC Mask Cross conv in feature space Block gradients Last foreground Kernels share weights Feature loss L1 loss
  • 21.
    Loss definition Feature loss ●Feature loss introduces supervision in deep feature space ● How to get the ground truth feature of last foreground in an unsupervised manner? ○ We observed a better performance of first foreground reconstruction ○ First foreground feature can be simply leveraged as pseudo ground truth by temporally flipping the input clip
  • 22.
    Index 1. Introduction 2. Motivation 3.Disentangle three features 4. Loss definition 5. Experiments 6. Conclusions
  • 23.
    Experiment Sanity check ● With(out)loss mask Without loss mask With loss mask
  • 24.
    Experiment Trained on thesubset of UCF-101 (with localization annotations)
  • 25.
    Experiment Discriminative task (actionrecognition) ● Architecture C3D Foreground Motion First Foreground Background softmax
  • 26.
    Experiment Discriminative task (actionrecognition) ● Result (random initialization vs our pretrained method)
  • 27.
    Conclusion Our contributions ● Ourmethod successfully simulates human perceptual grouping mechanism from motion cues ● Proposed method can successfully learn more generalized and richer video representations by disentangling motion, foreground and background ● Given the small amount of data used for pre-training, our method is promising in rendering better results with larger dataset
  • 28.
    Conclusion Future work ● Introduceunsupervised learning for foreground segmentation as proposed in uNLC ● Train with a larger amount of unlabeled data ● Introduce adversarial loss to improve the sharpness of the reconstructed frames ● Fill the gap of absent motion features between the first frame and the last frame by reconstructing any random frame in the clip