Online Video Object Segmentation via
Convolutional Trident Network
Seminar at Naver
Won-Dong Jang
Korea University
2017-08-22
Introduction & related works
Video object segmentation
• Clustering pixels in videos into objects or background
• Unsupervised
• Supervised
• Semi-supervised
Segment track
Video object segmentation
• Unsupervised video object segmentation
• Discover and segment a primary object in a video
• Without user annotations
Input videos Ground-truth segmentation labels
Video object segmentation
• Unsupervised video object segmentation
• Saliency detection-based approach
• Visual saliency
• Motion saliency
Examples of saliency detection
Video object segmentation
• Supervised video object segmentation
• User can annotate mislabeled pixels in any frames
• Interaction between an algorithm and a user
Annotation
at the first frame
Segmentation results after adding the annotation
Segmentation results
Additional
user annotation
Video object segmentation
• Semi-supervised video object segmentation
• Discover and segment a primary object in a video
• With user annotations in the first frame
Annotated target object
in the first frame
Segmentation results
Related works
• Object proposal-based algorithm
• Generate object proposals in each frame
• Select object proposals that are similar to the target object
[2015][ICCV][Perazzi] Fully Connected Object Proposals for Video Segmentation
Generated object proposals
Segment track
Related works
• Superpixel-based algorithm
• Over-segment each frame into superpixels
• Trace the target object based on the inter-frame matching
[2015][CVPR][Wen] Joint Online Tracking and Segmentation
Inter-frame matching Segment track
Proposed algorithm
In this presentation
• A semi-supervised online segmentation algorithm
• Online method
• Offline techniques require a huge memory space for a long video
• Deep learning-based approach
• Connection between a convolutional neural network and an MRF
optimization strategy
• Remarkable performance on the DAVIS benchmark dataset
Overview
• Framework
• Inter-frame propagation
• Inference via convolutional trident network (CTN)
• Yields three tailored probability maps for the MRF optimization
• MRF optimization
Inter-frame propagation
• Propagation from the previous frame
• To roughly locate the target object in the current frame
• Using backward optical flow
• From 𝑡 to 𝑡 − 1
Segmentation label map
for frame 𝑡 − 1
Backward motion
(optical flow)
Propagation map
for frame 𝑡
Convolutional trident network
• Infer segmentation information
• The propagation map may be inaccurate
• The inferred information is effectively used to solve a binary
labeling problem
Network architecture
• Encoder-decoder architecture
• Single encoder
• VGG encoder
• Three decoders
• Separative decoder
• Definite foreground decoder
• Definite background decoder
Network architecture
• Encoder-decoder architecture
• Single encoder
• VGG encoder
• Three decoders
• Separative decoder
• Definite foreground decoder
• Definite background decoder
Network architecture
• VGG encoder
• Trained for image classification
• Using ImageNet dataset
22K categories
14M images
Network architecture
• VGG encoder
• Trained for image classification
• Using ImageNet dataset
Convolutional layers
Fully connected
layers
Fox
Cat
Lion
Dog
Zebra
Umbrella
Tank
Lamp
Desk
Orange
Kite
…
Network architecture
• Separative decoder (SD)
• Separate a target object from the background
• Using a down-sampled foreground propagation patch
Conv1_1
Conv1_2
Pooling
Conv2_1
Conv2_2
Pooling
Conv3_1
Conv3_2
Conv3_3
Pooling
Conv4_1
Conv4_2
Conv4_3
Pooling
Conv5_1
Conv5_2
Conv5_3
SD-Dec2
SD-Dec3
SD-Dec4
SD-Dec5
SD-Pred
SD-Dec1
Unpooling
Unpooling
Unpooling
Unpooling
Skip connections
Image patch
Separative
probability patch
Foreground
propagation patch
Network architecture
• Definite foreground and background decoders
• Definite pixels indicate locations that should be labeled as the
foreground or the background indubitably
• Fixing labels in definite pixels improves labeling accuracies
• Inspired by the image matting problem
Input image Tri-map Matting result
Network architecture
• Definite foreground decoder (DFD)
• Identifies definite foreground pixels
Conv1_1
Conv1_2
Pooling
Conv2_1
Conv2_2
Pooling
Conv3_1
Conv3_2
Conv3_3
Pooling
Conv4_1
Conv4_2
Conv4_3
Pooling
Conv5_1
Conv5_2
Conv5_3
DFD-Dec2
DFD-Dec3
DFD-Dec4
DFD-Dec5
DFD-Pred
DFD-Dec1
Unpooling
Unpooling
Unpooling
Unpooling
Image patch
Definite foreground
probability patch
Foreground
propagation patch
Network architecture
• Definite background decoder (DBD)
• Finds definite background pixels
Conv1_1
Conv1_2
Pooling
Conv2_1
Conv2_2
Pooling
Conv3_1
Conv3_2
Conv3_3
Pooling
Conv4_1
Conv4_2
Conv4_3
Pooling
Conv5_1
Conv5_2
Conv5_3
DBD-Dec2
DBD-Dec3
DBD-Dec4
DBD-Dec5
DBD-Pred
DBD-Dec1
Unpooling
Unpooling
Unpooling
Unpooling
Image patch
Definite background
probability patch
Background
propagation patch
Network architecture
• Implementation issues in decoders
• Prediction layer
• Sigmoid layer is used to yield normalized outputs within [0, 1]
• Rectified linear unit (ReLU) + Batch normalization
• Kernel size
• 3 ×3 kernels are used in the prediction layers
• 5 × 5 kernels are used in the other convolution layers
Training phase
• Lack of video object segmentation dataset
• There are several datasets for video object segmentation
• However, each of them consists of a small number of videos from
12 to 59
• Instead,
• We use the PASCAL VOC 2012 dataset
• Object segmentation
• 26,844 object masks
• 11,355 images
Training phase
• Preprocessing of training data
• Ground-truth masks for the DFD and DBD are not available
• Hence, we generate them through simple image processing
Training phase
• Preprocessing of training data
• Degrade the objet mask to imitate propagation errors
• By performing suppression and noise addition
• Synthesize the ground-truth masks for the DFD and DBD
• By applying erosion and dilation
Training phase
• Implementation issues
• Caffe library
• Cross-entropy losses
−
1
𝑛
෍
𝑛=1
𝑁
𝑝 𝑛 log Ƹ𝑝 𝑛 + 1 − 𝑝 𝑛 log 1 − Ƹ𝑝 𝑛
• Minibatch
• with eight training data
• Learning rate
• 1e-3 for the first 55 epochs
• 1e-4 for the next 35 epochs
• Stochastic gradient descent
Inference phase
• Set input data
• By cropping the frame and propagation map
• CTN outputs three probability patches
• Separative probability map 𝑅S
• Definite foreground probability map 𝑅F
• Definite background probability map 𝑅B
Frame 𝑡
Segmentation label
at frame 𝑡 − 1
Optical flow
Inter-frame propagationInput at frame 𝑡
Propagation map
at frame 𝑡
Inference via convolutional encoder-decoder network
Encoder
Background
propagation patch
Foreground
propagation patch
Image patch
Definite
background
decoder
Definite
foreground
decoder
Separative
decoder
Definite background
probability patch
Definite foreground
probability patch
Separative
probability patch
Inference phase
• Classification
• Separative probability map 𝑅S
• If 𝑅S(𝐩) > 𝜃sep, pixel 𝐩 is classified as the foreground
• ℒ be the coordinate set for such foreground pixels
• Definite foreground probability map 𝑅F
• If 𝑅F(𝐩) > 𝜃def, pixel 𝐩 is classified as the definite foreground
• ℱ denotes the set of the definite foreground pixels
• Definite background probability map 𝑅B
• If 𝑅B(𝐩) > 𝜃def, pixel 𝐩 is classified as the definite background
• ℬ indicates the set of the definite background pixels
MRF optimization
• Solve two-class MRF optimization problem
• To improve the segmentation quality further
• Define a graph 𝐺 = (𝑁, 𝐸)
• Nodes are pixels in the current frame
• Each node is connected to its four neighbors by edges
MRF optimization
• MRF energy function
ℰ 𝑆 = ෍
𝐩∈𝑁
𝒟 𝐩, 𝑆 + 𝛾 × ෍
𝐩,𝐪 ∈𝐸
𝒵 𝐩, 𝐪, 𝑆
Definite background
probability patch
Definite foreground
probability patch
Image patch Separative
probability patch
Unary cost 𝒟
- Returns extremely high costs on DF and DB pixels
when they have background and foreground labels
Pairwise cost 𝒵
- Encourage neighboring pixels
to have the same label
MRF optimization
• Unary cost computation
• Build the RGB color Gaussian mixture models (GMMs) of the
foreground and the background, respectively
• 𝐾 = 10 for both GMMs
• Use pixels in ℒ to construct the foreground GMMs
• Use pixels in ℒ 𝑐
to construct the background GMMs
• Gaussian cost
𝜓 𝐩, 𝑠 = min
𝑘
− log 𝑓 𝐩; ℳ𝑠,𝑘
• Unary cost
𝒟 𝐩, 𝑆 = ൞
∞ if 𝑝 ∈ ℱ and 𝑆 𝑝 = 0
∞ if 𝑝 ∈ ℱ and 𝑆 𝑝 = 0
𝜓 𝐩, 𝑆(𝐩) otherwise
MRF optimization
• Pairwise cost computation
• Pairwise cost
𝒵 𝐩, 𝐪, 𝑆 = ቊ
exp −𝑑 𝐩, 𝐪 if 𝑆 𝐩 ≠ 𝑆 𝐪
0 otherwise
• Graph-cut optimization
ℰ 𝑆 = ෍
𝐩∈𝑁
𝒟 𝐩, 𝑆 + 𝛾 × ෍
𝐩,𝐪 ∈𝐸
𝒵 𝐩, 𝐪, 𝑆
Reappearing object detection
• Identification of reappearing parts
• A target object may disappear and be occluded by other
objects
• Use backward-forward motion consistency
Experimental results
Experimental results
• The DAVIS benchmark dataset
• 50 videos
• 854 × 480 resolution
• Number of frames
• From 25 to 104
• Difficulties
• Fast motion
• Occlusion
• Object deformation
Experimental results
• Performance measures
• Region similarity
• Jaccard index
• Contour accuracy
• F-measure
• Statistics
• Mean
• Recall
• Decay
Experimental results
• Performance comparison
Experimental results
• Qualitative results
Experimental results
• The SegTrack dataset
• 5 videos
• Performance comparison
• Jaccard index
Experimental results
• Ablation studies
• Effectiveness of each decoder
• Efficacy of MRF optimization
Experimental results
• Running time analysis
• Prop-Q
• Uses the state-of-the-art optical flow technique
• Prop-F
• Adopts a much faster optical flow technique
• Parameter selection
Conclusions
• A semi-supervised online video object segmentation
algorithm is introduced in this presentation
• Deep learning-based semi-supervised video object
segmentation algorithm
• Tailored network for MRF optimization
• Remarkable performance on the DAVIS dataset
• Q&A
• Thank you

Online video object segmentation via convolutional trident network

  • 1.
    Online Video ObjectSegmentation via Convolutional Trident Network Seminar at Naver Won-Dong Jang Korea University 2017-08-22
  • 2.
  • 3.
    Video object segmentation •Clustering pixels in videos into objects or background • Unsupervised • Supervised • Semi-supervised Segment track
  • 4.
    Video object segmentation •Unsupervised video object segmentation • Discover and segment a primary object in a video • Without user annotations Input videos Ground-truth segmentation labels
  • 5.
    Video object segmentation •Unsupervised video object segmentation • Saliency detection-based approach • Visual saliency • Motion saliency Examples of saliency detection
  • 6.
    Video object segmentation •Supervised video object segmentation • User can annotate mislabeled pixels in any frames • Interaction between an algorithm and a user Annotation at the first frame Segmentation results after adding the annotation Segmentation results Additional user annotation
  • 7.
    Video object segmentation •Semi-supervised video object segmentation • Discover and segment a primary object in a video • With user annotations in the first frame Annotated target object in the first frame Segmentation results
  • 8.
    Related works • Objectproposal-based algorithm • Generate object proposals in each frame • Select object proposals that are similar to the target object [2015][ICCV][Perazzi] Fully Connected Object Proposals for Video Segmentation Generated object proposals Segment track
  • 9.
    Related works • Superpixel-basedalgorithm • Over-segment each frame into superpixels • Trace the target object based on the inter-frame matching [2015][CVPR][Wen] Joint Online Tracking and Segmentation Inter-frame matching Segment track
  • 10.
  • 11.
    In this presentation •A semi-supervised online segmentation algorithm • Online method • Offline techniques require a huge memory space for a long video • Deep learning-based approach • Connection between a convolutional neural network and an MRF optimization strategy • Remarkable performance on the DAVIS benchmark dataset
  • 12.
    Overview • Framework • Inter-framepropagation • Inference via convolutional trident network (CTN) • Yields three tailored probability maps for the MRF optimization • MRF optimization
  • 13.
    Inter-frame propagation • Propagationfrom the previous frame • To roughly locate the target object in the current frame • Using backward optical flow • From 𝑡 to 𝑡 − 1 Segmentation label map for frame 𝑡 − 1 Backward motion (optical flow) Propagation map for frame 𝑡
  • 14.
    Convolutional trident network •Infer segmentation information • The propagation map may be inaccurate • The inferred information is effectively used to solve a binary labeling problem
  • 15.
    Network architecture • Encoder-decoderarchitecture • Single encoder • VGG encoder • Three decoders • Separative decoder • Definite foreground decoder • Definite background decoder
  • 16.
    Network architecture • Encoder-decoderarchitecture • Single encoder • VGG encoder • Three decoders • Separative decoder • Definite foreground decoder • Definite background decoder
  • 17.
    Network architecture • VGGencoder • Trained for image classification • Using ImageNet dataset 22K categories 14M images
  • 18.
    Network architecture • VGGencoder • Trained for image classification • Using ImageNet dataset Convolutional layers Fully connected layers Fox Cat Lion Dog Zebra Umbrella Tank Lamp Desk Orange Kite …
  • 19.
    Network architecture • Separativedecoder (SD) • Separate a target object from the background • Using a down-sampled foreground propagation patch Conv1_1 Conv1_2 Pooling Conv2_1 Conv2_2 Pooling Conv3_1 Conv3_2 Conv3_3 Pooling Conv4_1 Conv4_2 Conv4_3 Pooling Conv5_1 Conv5_2 Conv5_3 SD-Dec2 SD-Dec3 SD-Dec4 SD-Dec5 SD-Pred SD-Dec1 Unpooling Unpooling Unpooling Unpooling Skip connections Image patch Separative probability patch Foreground propagation patch
  • 20.
    Network architecture • Definiteforeground and background decoders • Definite pixels indicate locations that should be labeled as the foreground or the background indubitably • Fixing labels in definite pixels improves labeling accuracies • Inspired by the image matting problem Input image Tri-map Matting result
  • 21.
    Network architecture • Definiteforeground decoder (DFD) • Identifies definite foreground pixels Conv1_1 Conv1_2 Pooling Conv2_1 Conv2_2 Pooling Conv3_1 Conv3_2 Conv3_3 Pooling Conv4_1 Conv4_2 Conv4_3 Pooling Conv5_1 Conv5_2 Conv5_3 DFD-Dec2 DFD-Dec3 DFD-Dec4 DFD-Dec5 DFD-Pred DFD-Dec1 Unpooling Unpooling Unpooling Unpooling Image patch Definite foreground probability patch Foreground propagation patch
  • 22.
    Network architecture • Definitebackground decoder (DBD) • Finds definite background pixels Conv1_1 Conv1_2 Pooling Conv2_1 Conv2_2 Pooling Conv3_1 Conv3_2 Conv3_3 Pooling Conv4_1 Conv4_2 Conv4_3 Pooling Conv5_1 Conv5_2 Conv5_3 DBD-Dec2 DBD-Dec3 DBD-Dec4 DBD-Dec5 DBD-Pred DBD-Dec1 Unpooling Unpooling Unpooling Unpooling Image patch Definite background probability patch Background propagation patch
  • 23.
    Network architecture • Implementationissues in decoders • Prediction layer • Sigmoid layer is used to yield normalized outputs within [0, 1] • Rectified linear unit (ReLU) + Batch normalization • Kernel size • 3 ×3 kernels are used in the prediction layers • 5 × 5 kernels are used in the other convolution layers
  • 24.
    Training phase • Lackof video object segmentation dataset • There are several datasets for video object segmentation • However, each of them consists of a small number of videos from 12 to 59 • Instead, • We use the PASCAL VOC 2012 dataset • Object segmentation • 26,844 object masks • 11,355 images
  • 25.
    Training phase • Preprocessingof training data • Ground-truth masks for the DFD and DBD are not available • Hence, we generate them through simple image processing
  • 26.
    Training phase • Preprocessingof training data • Degrade the objet mask to imitate propagation errors • By performing suppression and noise addition • Synthesize the ground-truth masks for the DFD and DBD • By applying erosion and dilation
  • 27.
    Training phase • Implementationissues • Caffe library • Cross-entropy losses − 1 𝑛 ෍ 𝑛=1 𝑁 𝑝 𝑛 log Ƹ𝑝 𝑛 + 1 − 𝑝 𝑛 log 1 − Ƹ𝑝 𝑛 • Minibatch • with eight training data • Learning rate • 1e-3 for the first 55 epochs • 1e-4 for the next 35 epochs • Stochastic gradient descent
  • 28.
    Inference phase • Setinput data • By cropping the frame and propagation map • CTN outputs three probability patches • Separative probability map 𝑅S • Definite foreground probability map 𝑅F • Definite background probability map 𝑅B Frame 𝑡 Segmentation label at frame 𝑡 − 1 Optical flow Inter-frame propagationInput at frame 𝑡 Propagation map at frame 𝑡 Inference via convolutional encoder-decoder network Encoder Background propagation patch Foreground propagation patch Image patch Definite background decoder Definite foreground decoder Separative decoder Definite background probability patch Definite foreground probability patch Separative probability patch
  • 29.
    Inference phase • Classification •Separative probability map 𝑅S • If 𝑅S(𝐩) > 𝜃sep, pixel 𝐩 is classified as the foreground • ℒ be the coordinate set for such foreground pixels • Definite foreground probability map 𝑅F • If 𝑅F(𝐩) > 𝜃def, pixel 𝐩 is classified as the definite foreground • ℱ denotes the set of the definite foreground pixels • Definite background probability map 𝑅B • If 𝑅B(𝐩) > 𝜃def, pixel 𝐩 is classified as the definite background • ℬ indicates the set of the definite background pixels
  • 30.
    MRF optimization • Solvetwo-class MRF optimization problem • To improve the segmentation quality further • Define a graph 𝐺 = (𝑁, 𝐸) • Nodes are pixels in the current frame • Each node is connected to its four neighbors by edges
  • 31.
    MRF optimization • MRFenergy function ℰ 𝑆 = ෍ 𝐩∈𝑁 𝒟 𝐩, 𝑆 + 𝛾 × ෍ 𝐩,𝐪 ∈𝐸 𝒵 𝐩, 𝐪, 𝑆 Definite background probability patch Definite foreground probability patch Image patch Separative probability patch Unary cost 𝒟 - Returns extremely high costs on DF and DB pixels when they have background and foreground labels Pairwise cost 𝒵 - Encourage neighboring pixels to have the same label
  • 32.
    MRF optimization • Unarycost computation • Build the RGB color Gaussian mixture models (GMMs) of the foreground and the background, respectively • 𝐾 = 10 for both GMMs • Use pixels in ℒ to construct the foreground GMMs • Use pixels in ℒ 𝑐 to construct the background GMMs • Gaussian cost 𝜓 𝐩, 𝑠 = min 𝑘 − log 𝑓 𝐩; ℳ𝑠,𝑘 • Unary cost 𝒟 𝐩, 𝑆 = ൞ ∞ if 𝑝 ∈ ℱ and 𝑆 𝑝 = 0 ∞ if 𝑝 ∈ ℱ and 𝑆 𝑝 = 0 𝜓 𝐩, 𝑆(𝐩) otherwise
  • 33.
    MRF optimization • Pairwisecost computation • Pairwise cost 𝒵 𝐩, 𝐪, 𝑆 = ቊ exp −𝑑 𝐩, 𝐪 if 𝑆 𝐩 ≠ 𝑆 𝐪 0 otherwise • Graph-cut optimization ℰ 𝑆 = ෍ 𝐩∈𝑁 𝒟 𝐩, 𝑆 + 𝛾 × ෍ 𝐩,𝐪 ∈𝐸 𝒵 𝐩, 𝐪, 𝑆
  • 34.
    Reappearing object detection •Identification of reappearing parts • A target object may disappear and be occluded by other objects • Use backward-forward motion consistency
  • 35.
  • 36.
    Experimental results • TheDAVIS benchmark dataset • 50 videos • 854 × 480 resolution • Number of frames • From 25 to 104 • Difficulties • Fast motion • Occlusion • Object deformation
  • 37.
    Experimental results • Performancemeasures • Region similarity • Jaccard index • Contour accuracy • F-measure • Statistics • Mean • Recall • Decay
  • 38.
  • 39.
  • 40.
    Experimental results • TheSegTrack dataset • 5 videos • Performance comparison • Jaccard index
  • 41.
    Experimental results • Ablationstudies • Effectiveness of each decoder • Efficacy of MRF optimization
  • 42.
    Experimental results • Runningtime analysis • Prop-Q • Uses the state-of-the-art optical flow technique • Prop-F • Adopts a much faster optical flow technique • Parameter selection
  • 43.
    Conclusions • A semi-supervisedonline video object segmentation algorithm is introduced in this presentation • Deep learning-based semi-supervised video object segmentation algorithm • Tailored network for MRF optimization • Remarkable performance on the DAVIS dataset • Q&A • Thank you