Video Object Segmentation in Videos

Video Object Segmentation
고려대학교 고영준

• Divide data into meaningful segments
Segmentation
Superpixel Image segmentation
Video segmentation Video object segmentation

Video Object Segmentation
• Semi-supervised video object segmentation
• Primary object segmentation
• Multiple object segmentation

Semi-supervised Video Object Segmentation
• Track and segment a target object
• Annotated by a user in the first frame
First frame
& user annotation
Segment track

Primary Object Segmentation
• Segment a primary object in a video automatically
Primary object: Diver
Primary object: Tennis player

Multiple Object Segmentation
• Extract multiple segment tracks as many as possible

Primary Object Segmentation
• Primary object segmentation
• Initial region estimation
• Motion boundaries
• Object proposal
• Saliency maps
• Refinement
• Construct models for the primary object and the background,
e.g. Gaussian mixture models (GMMs)
• Propose augmentation and reduction process (ARP)

Primary Object Segmentation in Videos Based on
Region Augmentation and Reduction
• Overview
• Input: A set of consecutive video frames
• Output: A set of pixel-wise segments to delineate the primary
object

Candidate Region Generation
• Candidate regions
• Ultrametric contour map (UCM)
• Obtain color-based and motion-based UCMs
• Each region in UCM becomes a superpixel

• Candidate regions
• Generate candidate regions by merging neighboring superpixels
• Determine the pair, 𝑠 𝑚 and 𝑠 𝑛, sharing the weakest boundary
• Merge 𝑠 𝑚 and 𝑠 𝑛 in a single superpixel
• Repeat this process only one superpixel remains

• Foreground confidence
• Measure the foreground confidence of each candidate region
• Appearance confidence 𝜙𝑖
(𝑡)
• Obtain a saliency map using technique in [1]
• Average the saliency values within the candidate region
• Edge confidence 𝜓𝑖
(𝑡)
• Combine color-based edge map and motion-based edge map
𝑐𝑖
(𝑡)
= 𝜙𝑖
(𝑡)
+ 𝜓𝑖
(𝑡)
[1] W.-D. Jang, C. Lee, and C.-S. Kim, “Primary object segmentation in videos via alternate convex optimization of foreground and
background distributions,” CVPR, 2016

• Foreground confidence
• Select the top 20 candidate regions
• Warp the selected candidate regions to neighboring frames
• Rearrange the set of candidate regions 𝒬(𝑡) = 𝑞1
𝑡
, 𝑞2
𝑡
, … , 𝑞 𝑁
(𝑡)
• Feature description
• Describe the feature 𝐟𝑖
(𝑡)
of each candidate region 𝑞𝑖
(𝑡)
using the
bag-of-visual-words approach

Initial Region Estimation
• Selecting initial primary object regions
• Choose the main region 𝑞 𝛿
(𝑡)
among candidate regions
• Exploit the recurrence property that a primary object appears
repeatedly in a video sequence
Input frames
Candidate region
generation
Initial region
estimation

• Assume that feature of main region 𝑞 𝛿
(𝑡)
should be similar to
features of the main regions in the other frames
• 𝐩 𝜏
denotes the feature of the main region in frame 𝐼(𝜏)
𝛿 = arg min ෍
𝜏=1,𝜏≠𝑡
𝑑 𝜒 𝐟𝑖
(𝑡)
, 𝐩 𝜏
Input frames
Candidate region
generation
Initial region
estimation

• Initialization of 𝐩 𝜏
• Superpose features of all candidate region in 𝒬(𝜏)
• Combine features of candidate regions, 𝐅(𝜏) = 𝐟1
𝜏
, … , 𝐟 𝑁
𝜏
, using
the foreground confidence vector 𝐜(𝜏) = 𝑐1
𝜏
, … , 𝑐 𝑁
𝜏
𝑇
• Obtain the main region 𝑞 𝛿
(𝑡)
by applying 𝐩 𝜏
for each frame
• Alternative update of the main regions
• Update 𝐩 𝑡 for each frame by 𝐩 𝑡 ← 𝐟𝛿
𝜏
• Choose the main region using the updated features
𝐩 𝜏
= 𝐅(𝜏)
𝐜(𝜏)
𝛿 = arg min ෍
𝜏=1,𝜏≠𝑡
𝑑 𝜒 𝐟𝑖
(𝑡)
, 𝐩 𝜏

Primary Object Region Refinement
• Refinement of primary object regions
• Initial regions may exclude parts of primary objects or include
noisy regions (background or other objects)
• Attempt to refine initial regions
• Augment initial regions with missing region
• Reducing initial regions by removing noisy regions

• Augmented regions
• Augment initial regions 𝑞 𝛿
𝑡
with candidate region 𝑞𝑖
𝑡
in 𝒬(𝑡)
• Reduced regions
• Reduce initial regions 𝑞 𝛿
𝑡
using candidate region 𝑞 𝑗
𝑡
in 𝒬(𝑡)
𝑞 𝛿
𝑡
𝑞𝑖
𝑡
𝑞𝑖
𝑡
𝑞 𝛿
𝑡
𝑟𝑖
𝑡
= 𝑞 𝛿
𝑡
∪ 𝑞𝑖
𝑡
𝑞 𝛿
𝑡
𝑞 𝑗
𝑡
𝑞 𝛿
𝑡
𝑞 𝑗
𝑡
𝑟𝑗
𝑡
= 𝑞 𝛿
𝑡
∩ 𝑞 𝑗
𝑡

• Augmentation and reduction process (ARP)
• Determine whether to augment or reduce 𝑞 𝛿
𝑡
by cost function
• Data cost
• Constrain that the refined region 𝑟𝑖
(𝑡)
should be similar to initial
regions in all frames
• Segmentation cost
• Make the refined region 𝑟𝑖
(𝑡)
as dissimilar from its nearby
background as possible
𝐶 𝑟𝑖
(𝑡)
= 𝐶data 𝑟𝑖
(𝑡)
+ 𝛾 ⋅ 𝐶seg 𝑟𝑖
(𝑡)
𝐶data 𝑟𝑖
(𝑡)
=
1
𝑇
෍
𝜏=1
𝑑 𝜒 𝐟r,𝑖
(𝑡)
, 𝐟𝛿
(𝑡)
𝐶seg 𝑟𝑖
(𝑡)
= −𝑑 𝜒 𝐟r,𝑖
(𝑡)
, 𝐟b,𝑖
(𝑡)

• Minimize the cost function for the optimal refined region
• Perform ARP iteratively
• Construct the set of augmented and reduced regions again by
employing 𝑟∗
𝑡
as the initial region
• Find the optimal 𝑟∗
𝑡
by minimizing 𝐶 𝑟𝑖
(𝑡)
• Repeat until 𝑟∗
𝑡
is unchanged
𝑟∗
𝑡
= arg min 𝐶 𝑟𝑖
(𝑡)

• DAVIS dataset [2]
• 50 video sequences (3,455 annotated frames)
• Performance measure
• Region similarity 𝒥: Intersection over union
• Contour accuracy ℱ: F-measure that is the harmonic mean of the
contour precision and recall rates
Experimental results
[2] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation
methodology for video object segmentation,” CVPR 2016

• Impacts of ARP
• Compare ARP with the conventional refinement techniques [20,
36]
• Apply refinement techniques to our initial regions (IR)
[20] A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained video,” ICCV,2013.
[36] D. Zhang, O. Javed, and M. Shah, “Video object segmentation through spatially accurate and temporally dense extraction of
primary object regions,” CVPR, 2013.

• Quantitative comparison
• Semi-supervised: Human annotation at the first frame
• Multiple VOS: Output multiple objects
• POS: Output primary object objects

• Qualitative results

Multiple Object Segmentation
• Multiple object segmentation
• Motion segmentation
• Cluster point trajectories in a video
• Video object proposal
• Proposal matching
• Proposal clustering
• Segmentation guided by object detection and tracking

CDTS: Collaborative Detection, Tracking, and Segmentation
for Online Multiple Object Segmentation in videos
• Overview
• Input: A set of consecutive video frames
• Output: Multiple segment tracks
Input frames
Detection and
tracking results
Joint detection
and tracking
ASE segmentationObject track generation

Object Track Generation
• Joint detection and tracking
• Detector [3]
• Find object location without manual annotations
• Some objects may remain undetected
• Tracker [4]
• Boost the recall rate of objects using temporal correlations
• Three cases
• Both detection and tracking boxes
• Only detection box
• Only tracking box
[3] Y. Li, K. He, J. Sun, et al. “R-FCN: Object detection via region-based fully convolutional networks,” NIPS, 2016
[4] H.-U. Kim, D.-Y. Lee, J.-Y. Sim, and C.-S. Kim, “SOWP: Spatially ordered and weighted patch descriptor for visual tracking,” ICCV, 2015.

Object Track Generation
• Joint detection and tracking
• Both detection and tracking boxes
• Match detection and tracking boxes
• The Hungarian algorithm
• Choose the more accurate box for each matching pair
• Link the selected box to the corresponding object track
• Unmatched detection box
• Regard as newly appearing object
• Unmatched tracking box
• Link to the corresponding object track

ASE Segmentation
• Alternate shrinking and expansion (ASE)
• Over-segment frame in to superpixels
• Dichotomize each superpixel within and near the box into
either foreground or background class

ASE Segmentation
• Over-segmentation
• Obtain superpixels using UCM
• Preliminary classification
• Exploit overlap ratio between the box and each superpixel
• Refine preliminary foreground regions

ASE Segmentation
• Intra-frame refinement
• Constrain foreground regions to have intense edge strengths
• Boundary cost
• Shrink foreground regions by remove superpixels to minimize
the boundary cost in a greedy manner
𝐶bnd 𝐹𝑖
(𝑡)
= − ෍
𝐱∈𝜕𝐹𝑖
(𝑡)
𝑈 𝑡
𝐱

ASE Segmentation
• Inter-frame refinement
• Constrain that the refined region should be similar to the
segmentation results in previous frames
• Cost function
• Expand foreground regions by augmenting superpixels
• Perform shrinking in a similar way
𝐶inter 𝐹𝑖
(𝑡)
, ℬ𝑖
(𝑡)
= 𝛼 ⋅ 𝐶tmp 𝐹𝑖
𝑡
+ 𝐶seg 𝐹𝑖
(𝑡)
, ℬ𝑖
(𝑡)
+𝐶bnd 𝐹𝑖
(𝑡)

Experimental Results
• YouTube-Objects dataset
• Contain 126 videos for 10 object classes
• Performance measure
• Intersection over union (IoU)
[34] Y.-H. Tsai, G. Zhong, and M.-H. Yang, “Semantic cosegmentation in videos.,” ECCV,2016.
[42] Y. Zhang, X. Chen, J. Li, C. Wang, and C. Xia, “Semantic object segmentation via detection in weakly labeled video,” CVPR 2015.

Video Object Segmentation in Videos

More Related Content

What's hot

Viewers also liked

Similar to Video Object Segmentation in Videos

More from NAVER Engineering

Recently uploaded

Video Object Segmentation in Videos