발표자: 장원동 (고려대 박사과정)
발표일: 2017.8.
개요:
A semi-supervised online video object segmentation algorithm, which accepts user annotations about a target object at the first frame, will be presented. It propagates the segmentation labels at the previous frame to the current frame using optical flow vectors.
However, the propagation is error-prone. Therefore, I’ve developed the convolutional trident network, which has three decoding branches: separative, definite foreground, and definite background decoders.
Then, the algorithm performs Markov random field optimization based on outputs of the three decoders.
These process is sequentially carried out from the second to the last frames to extract a segment track of the target object.
Experimental results will demonstrate that this algorithm significantly outperforms the state-of-the-art conventional algorithms on the DAVIS benchmark dataset.
3. Video object segmentation
• Clustering pixels in videos into objects or background
• Unsupervised
• Supervised
• Semi-supervised
Segment track
4. Video object segmentation
• Unsupervised video object segmentation
• Discover and segment a primary object in a video
• Without user annotations
Input videos Ground-truth segmentation labels
5. Video object segmentation
• Unsupervised video object segmentation
• Saliency detection-based approach
• Visual saliency
• Motion saliency
Examples of saliency detection
6. Video object segmentation
• Supervised video object segmentation
• User can annotate mislabeled pixels in any frames
• Interaction between an algorithm and a user
Annotation
at the first frame
Segmentation results after adding the annotation
Segmentation results
Additional
user annotation
7. Video object segmentation
• Semi-supervised video object segmentation
• Discover and segment a primary object in a video
• With user annotations in the first frame
Annotated target object
in the first frame
Segmentation results
8. Related works
• Object proposal-based algorithm
• Generate object proposals in each frame
• Select object proposals that are similar to the target object
[2015][ICCV][Perazzi] Fully Connected Object Proposals for Video Segmentation
Generated object proposals
Segment track
9. Related works
• Superpixel-based algorithm
• Over-segment each frame into superpixels
• Trace the target object based on the inter-frame matching
[2015][CVPR][Wen] Joint Online Tracking and Segmentation
Inter-frame matching Segment track
11. In this presentation
• A semi-supervised online segmentation algorithm
• Online method
• Offline techniques require a huge memory space for a long video
• Deep learning-based approach
• Connection between a convolutional neural network and an MRF
optimization strategy
• Remarkable performance on the DAVIS benchmark dataset
12. Overview
• Framework
• Inter-frame propagation
• Inference via convolutional trident network (CTN)
• Yields three tailored probability maps for the MRF optimization
• MRF optimization
13. Inter-frame propagation
• Propagation from the previous frame
• To roughly locate the target object in the current frame
• Using backward optical flow
• From 𝑡 to 𝑡 − 1
Segmentation label map
for frame 𝑡 − 1
Backward motion
(optical flow)
Propagation map
for frame 𝑡
14. Convolutional trident network
• Infer segmentation information
• The propagation map may be inaccurate
• The inferred information is effectively used to solve a binary
labeling problem
17. Network architecture
• VGG encoder
• Trained for image classification
• Using ImageNet dataset
22K categories
14M images
18. Network architecture
• VGG encoder
• Trained for image classification
• Using ImageNet dataset
Convolutional layers
Fully connected
layers
Fox
Cat
Lion
Dog
Zebra
Umbrella
Tank
Lamp
Desk
Orange
Kite
…
19. Network architecture
• Separative decoder (SD)
• Separate a target object from the background
• Using a down-sampled foreground propagation patch
Conv1_1
Conv1_2
Pooling
Conv2_1
Conv2_2
Pooling
Conv3_1
Conv3_2
Conv3_3
Pooling
Conv4_1
Conv4_2
Conv4_3
Pooling
Conv5_1
Conv5_2
Conv5_3
SD-Dec2
SD-Dec3
SD-Dec4
SD-Dec5
SD-Pred
SD-Dec1
Unpooling
Unpooling
Unpooling
Unpooling
Skip connections
Image patch
Separative
probability patch
Foreground
propagation patch
20. Network architecture
• Definite foreground and background decoders
• Definite pixels indicate locations that should be labeled as the
foreground or the background indubitably
• Fixing labels in definite pixels improves labeling accuracies
• Inspired by the image matting problem
Input image Tri-map Matting result
23. Network architecture
• Implementation issues in decoders
• Prediction layer
• Sigmoid layer is used to yield normalized outputs within [0, 1]
• Rectified linear unit (ReLU) + Batch normalization
• Kernel size
• 3 ×3 kernels are used in the prediction layers
• 5 × 5 kernels are used in the other convolution layers
24. Training phase
• Lack of video object segmentation dataset
• There are several datasets for video object segmentation
• However, each of them consists of a small number of videos from
12 to 59
• Instead,
• We use the PASCAL VOC 2012 dataset
• Object segmentation
• 26,844 object masks
• 11,355 images
25. Training phase
• Preprocessing of training data
• Ground-truth masks for the DFD and DBD are not available
• Hence, we generate them through simple image processing
26. Training phase
• Preprocessing of training data
• Degrade the objet mask to imitate propagation errors
• By performing suppression and noise addition
• Synthesize the ground-truth masks for the DFD and DBD
• By applying erosion and dilation
27. Training phase
• Implementation issues
• Caffe library
• Cross-entropy losses
−
1
𝑛
𝑛=1
𝑁
𝑝 𝑛 log Ƹ𝑝 𝑛 + 1 − 𝑝 𝑛 log 1 − Ƹ𝑝 𝑛
• Minibatch
• with eight training data
• Learning rate
• 1e-3 for the first 55 epochs
• 1e-4 for the next 35 epochs
• Stochastic gradient descent
28. Inference phase
• Set input data
• By cropping the frame and propagation map
• CTN outputs three probability patches
• Separative probability map 𝑅S
• Definite foreground probability map 𝑅F
• Definite background probability map 𝑅B
Frame 𝑡
Segmentation label
at frame 𝑡 − 1
Optical flow
Inter-frame propagationInput at frame 𝑡
Propagation map
at frame 𝑡
Inference via convolutional encoder-decoder network
Encoder
Background
propagation patch
Foreground
propagation patch
Image patch
Definite
background
decoder
Definite
foreground
decoder
Separative
decoder
Definite background
probability patch
Definite foreground
probability patch
Separative
probability patch
29. Inference phase
• Classification
• Separative probability map 𝑅S
• If 𝑅S(𝐩) > 𝜃sep, pixel 𝐩 is classified as the foreground
• ℒ be the coordinate set for such foreground pixels
• Definite foreground probability map 𝑅F
• If 𝑅F(𝐩) > 𝜃def, pixel 𝐩 is classified as the definite foreground
• ℱ denotes the set of the definite foreground pixels
• Definite background probability map 𝑅B
• If 𝑅B(𝐩) > 𝜃def, pixel 𝐩 is classified as the definite background
• ℬ indicates the set of the definite background pixels
30. MRF optimization
• Solve two-class MRF optimization problem
• To improve the segmentation quality further
• Define a graph 𝐺 = (𝑁, 𝐸)
• Nodes are pixels in the current frame
• Each node is connected to its four neighbors by edges
31. MRF optimization
• MRF energy function
ℰ 𝑆 =
𝐩∈𝑁
𝒟 𝐩, 𝑆 + 𝛾 ×
𝐩,𝐪 ∈𝐸
𝒵 𝐩, 𝐪, 𝑆
Definite background
probability patch
Definite foreground
probability patch
Image patch Separative
probability patch
Unary cost 𝒟
- Returns extremely high costs on DF and DB pixels
when they have background and foreground labels
Pairwise cost 𝒵
- Encourage neighboring pixels
to have the same label
32. MRF optimization
• Unary cost computation
• Build the RGB color Gaussian mixture models (GMMs) of the
foreground and the background, respectively
• 𝐾 = 10 for both GMMs
• Use pixels in ℒ to construct the foreground GMMs
• Use pixels in ℒ 𝑐
to construct the background GMMs
• Gaussian cost
𝜓 𝐩, 𝑠 = min
𝑘
− log 𝑓 𝐩; ℳ𝑠,𝑘
• Unary cost
𝒟 𝐩, 𝑆 = ൞
∞ if 𝑝 ∈ ℱ and 𝑆 𝑝 = 0
∞ if 𝑝 ∈ ℱ and 𝑆 𝑝 = 0
𝜓 𝐩, 𝑆(𝐩) otherwise
34. Reappearing object detection
• Identification of reappearing parts
• A target object may disappear and be occluded by other
objects
• Use backward-forward motion consistency
36. Experimental results
• The DAVIS benchmark dataset
• 50 videos
• 854 × 480 resolution
• Number of frames
• From 25 to 104
• Difficulties
• Fast motion
• Occlusion
• Object deformation
37. Experimental results
• Performance measures
• Region similarity
• Jaccard index
• Contour accuracy
• F-measure
• Statistics
• Mean
• Recall
• Decay
42. Experimental results
• Running time analysis
• Prop-Q
• Uses the state-of-the-art optical flow technique
• Prop-F
• Adopts a much faster optical flow technique
• Parameter selection
43. Conclusions
• A semi-supervised online video object segmentation
algorithm is introduced in this presentation
• Deep learning-based semi-supervised video object
segmentation algorithm
• Tailored network for MRF optimization
• Remarkable performance on the DAVIS dataset
• Q&A
• Thank you