Video Object Segmentation
고려대학교 고영준
Segmentation
• Divide data into meaningful segments
Segmentation
Superpixel Image segmentation
Video segmentation Video object segmentation
Video Object Segmentation
• Semi-supervised video object segmentation
• Primary object segmentation
• Multiple object segmentation
Semi-supervised Video Object Segmentation
• Track and segment a target object
• Annotated by a user in the first frame
First frame
& user annotation
Segment track
Primary Object Segmentation
• Segment a primary object in a video automatically
Primary object: Diver
Primary object: Tennis player
Multiple Object Segmentation
• Extract multiple segment tracks as many as possible
Primary Object Segmentation
Primary Object Segmentation
• Primary object segmentation
• Initial region estimation
• Motion boundaries
• Object proposal
• Saliency maps
• Refinement
• Construct models for the primary object and the background,
e.g. Gaussian mixture models (GMMs)
• Propose augmentation and reduction process (ARP)
Primary Object Segmentation in Videos Based on
Region Augmentation and Reduction
• Overview
• Input: A set of consecutive video frames
• Output: A set of pixel-wise segments to delineate the primary
object
Candidate Region Generation
• Candidate regions
• Ultrametric contour map (UCM)
• Obtain color-based and motion-based UCMs
• Each region in UCM becomes a superpixel
Candidate Region Generation
• Candidate regions
• Generate candidate regions by merging neighboring superpixels
• Determine the pair, 𝑠 𝑚 and 𝑠 𝑛, sharing the weakest boundary
• Merge 𝑠 𝑚 and 𝑠 𝑛 in a single superpixel
• Repeat this process only one superpixel remains
Candidate Region Generation
• Foreground confidence
• Measure the foreground confidence of each candidate region
• Appearance confidence 𝜙𝑖
(𝑡)
• Obtain a saliency map using technique in [1]
• Average the saliency values within the candidate region
• Edge confidence 𝜓𝑖
(𝑡)
• Combine color-based edge map and motion-based edge map
𝑐𝑖
(𝑡)
= 𝜙𝑖
(𝑡)
+ 𝜓𝑖
(𝑡)
[1] W.-D. Jang, C. Lee, and C.-S. Kim, “Primary object segmentation in videos via alternate convex optimization of foreground and
background distributions,” CVPR, 2016
Candidate Region Generation
• Foreground confidence
• Select the top 20 candidate regions
• Warp the selected candidate regions to neighboring frames
• Rearrange the set of candidate regions 𝒬(𝑡) = 𝑞1
𝑡
, 𝑞2
𝑡
, … , 𝑞 𝑁
(𝑡)
• Feature description
• Describe the feature 𝐟𝑖
(𝑡)
of each candidate region 𝑞𝑖
(𝑡)
using the
bag-of-visual-words approach
Initial Region Estimation
• Selecting initial primary object regions
• Choose the main region 𝑞 𝛿
(𝑡)
among candidate regions
• Exploit the recurrence property that a primary object appears
repeatedly in a video sequence
Input frames
Candidate region
generation
Initial region
estimation
Initial Region Estimation
• Selecting initial primary object regions
• Assume that feature of main region 𝑞 𝛿
(𝑡)
should be similar to
features of the main regions in the other frames
• 𝐩 𝜏
denotes the feature of the main region in frame 𝐼(𝜏)
𝛿 = arg min ෍
𝜏=1,𝜏≠𝑡
𝑑 𝜒 𝐟𝑖
(𝑡)
, 𝐩 𝜏
Input frames
Candidate region
generation
Initial region
estimation
Initial Region Estimation
• Selecting initial primary object regions
• Initialization of 𝐩 𝜏
• Superpose features of all candidate region in 𝒬(𝜏)
• Combine features of candidate regions, 𝐅(𝜏) = 𝐟1
𝜏
, … , 𝐟 𝑁
𝜏
, using
the foreground confidence vector 𝐜(𝜏) = 𝑐1
𝜏
, … , 𝑐 𝑁
𝜏
𝑇
• Obtain the main region 𝑞 𝛿
(𝑡)
by applying 𝐩 𝜏
for each frame
• Alternative update of the main regions
• Update 𝐩 𝑡 for each frame by 𝐩 𝑡 ← 𝐟𝛿
𝜏
• Choose the main region using the updated features
𝐩 𝜏
= 𝐅(𝜏)
𝐜(𝜏)
𝛿 = arg min ෍
𝜏=1,𝜏≠𝑡
𝑑 𝜒 𝐟𝑖
(𝑡)
, 𝐩 𝜏
Primary Object Region Refinement
• Refinement of primary object regions
• Initial regions may exclude parts of primary objects or include
noisy regions (background or other objects)
• Attempt to refine initial regions
• Augment initial regions with missing region
• Reducing initial regions by removing noisy regions
Primary Object Region Refinement
• Augmented regions
• Augment initial regions 𝑞 𝛿
𝑡
with candidate region 𝑞𝑖
𝑡
in 𝒬(𝑡)
• Reduced regions
• Reduce initial regions 𝑞 𝛿
𝑡
using candidate region 𝑞 𝑗
𝑡
in 𝒬(𝑡)
𝑞 𝛿
𝑡
𝑞𝑖
𝑡
𝑞𝑖
𝑡
𝑞 𝛿
𝑡
𝑟𝑖
𝑡
= 𝑞 𝛿
𝑡
∪ 𝑞𝑖
𝑡
𝑞 𝛿
𝑡
𝑞 𝑗
𝑡
𝑞 𝛿
𝑡
𝑞 𝑗
𝑡
𝑟𝑗
𝑡
= 𝑞 𝛿
𝑡
∩ 𝑞 𝑗
𝑡
Primary Object Region Refinement
• Augmentation and reduction process (ARP)
• Determine whether to augment or reduce 𝑞 𝛿
𝑡
by cost function
• Data cost
• Constrain that the refined region 𝑟𝑖
(𝑡)
should be similar to initial
regions in all frames
• Segmentation cost
• Make the refined region 𝑟𝑖
(𝑡)
as dissimilar from its nearby
background as possible
𝐶 𝑟𝑖
(𝑡)
= 𝐶data 𝑟𝑖
(𝑡)
+ 𝛾 ⋅ 𝐶seg 𝑟𝑖
(𝑡)
𝐶data 𝑟𝑖
(𝑡)
=
1
𝑇
෍
𝜏=1
𝑑 𝜒 𝐟r,𝑖
(𝑡)
, 𝐟𝛿
(𝑡)
𝐶seg 𝑟𝑖
(𝑡)
= −𝑑 𝜒 𝐟r,𝑖
(𝑡)
, 𝐟b,𝑖
(𝑡)
Primary Object Region Refinement
• Augmentation and reduction process (ARP)
• Minimize the cost function for the optimal refined region
• Perform ARP iteratively
• Construct the set of augmented and reduced regions again by
employing 𝑟∗
𝑡
as the initial region
• Find the optimal 𝑟∗
𝑡
by minimizing 𝐶 𝑟𝑖
(𝑡)
• Repeat until 𝑟∗
𝑡
is unchanged
𝑟∗
𝑡
= arg min 𝐶 𝑟𝑖
(𝑡)
Primary Object Region Refinement
• Augmentation and reduction process (ARP)
• DAVIS dataset [2]
• 50 video sequences (3,455 annotated frames)
• Performance measure
• Region similarity 𝒥: Intersection over union
• Contour accuracy ℱ: F-measure that is the harmonic mean of the
contour precision and recall rates
Experimental results
[2] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation
methodology for video object segmentation,” CVPR 2016
Experimental results
• Impacts of ARP
• Compare ARP with the conventional refinement techniques [20,
36]
• Apply refinement techniques to our initial regions (IR)
[20] A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained video,” ICCV,2013.
[36] D. Zhang, O. Javed, and M. Shah, “Video object segmentation through spatially accurate and temporally dense extraction of
primary object regions,” CVPR, 2013.
Experimental results
• Quantitative comparison
• Semi-supervised: Human annotation at the first frame
• Multiple VOS: Output multiple objects
• POS: Output primary object objects
Experimental results
• Qualitative results
Multiple Object Segmentation
Multiple Object Segmentation
• Multiple object segmentation
• Motion segmentation
• Cluster point trajectories in a video
• Video object proposal
• Proposal matching
• Proposal clustering
• Segmentation guided by object detection and tracking
CDTS: Collaborative Detection, Tracking, and Segmentation
for Online Multiple Object Segmentation in videos
• Overview
• Input: A set of consecutive video frames
• Output: Multiple segment tracks
Input frames
Detection and
tracking results
Joint detection
and tracking
ASE segmentationObject track generation
Object Track Generation
• Joint detection and tracking
• Detector [3]
• Find object location without manual annotations
• Some objects may remain undetected
• Tracker [4]
• Boost the recall rate of objects using temporal correlations
• Three cases
• Both detection and tracking boxes
• Only detection box
• Only tracking box
[3] Y. Li, K. He, J. Sun, et al. “R-FCN: Object detection via region-based fully convolutional networks,” NIPS, 2016
[4] H.-U. Kim, D.-Y. Lee, J.-Y. Sim, and C.-S. Kim, “SOWP: Spatially ordered and weighted patch descriptor for visual tracking,” ICCV, 2015.
Object Track Generation
• Joint detection and tracking
• Both detection and tracking boxes
• Match detection and tracking boxes
• The Hungarian algorithm
• Choose the more accurate box for each matching pair
• Link the selected box to the corresponding object track
• Unmatched detection box
• Regard as newly appearing object
• Unmatched tracking box
• Link to the corresponding object track
ASE Segmentation
• Alternate shrinking and expansion (ASE)
• Over-segment frame in to superpixels
• Dichotomize each superpixel within and near the box into
either foreground or background class
ASE Segmentation
• Over-segmentation
• Obtain superpixels using UCM
• Preliminary classification
• Exploit overlap ratio between the box and each superpixel
• Refine preliminary foreground regions
ASE Segmentation
• Intra-frame refinement
• Constrain foreground regions to have intense edge strengths
• Boundary cost
• Shrink foreground regions by remove superpixels to minimize
the boundary cost in a greedy manner
𝐶bnd 𝐹𝑖
(𝑡)
= − ෍
𝐱∈𝜕𝐹𝑖
(𝑡)
𝑈 𝑡
𝐱
ASE Segmentation
• Inter-frame refinement
• Constrain that the refined region should be similar to the
segmentation results in previous frames
• Cost function
• Expand foreground regions by augmenting superpixels
• Perform shrinking in a similar way
𝐶inter 𝐹𝑖
(𝑡)
, ℬ𝑖
(𝑡)
= 𝛼 ⋅ 𝐶tmp 𝐹𝑖
𝑡
+ 𝐶seg 𝐹𝑖
(𝑡)
, ℬ𝑖
(𝑡)
+𝐶bnd 𝐹𝑖
(𝑡)
ASE Segmentation
Experimental Results
• YouTube-Objects dataset
• Contain 126 videos for 10 object classes
• Performance measure
• Intersection over union (IoU)
[34] Y.-H. Tsai, G. Zhong, and M.-H. Yang, “Semantic cosegmentation in videos.,” ECCV,2016.
[42] Y. Zhang, X. Chen, J. Li, C. Wang, and C. Xia, “Semantic object segmentation via detection in weakly labeled video,” CVPR 2015.
Experimental results
• Qualitative results
Q&A
• Thank you

Video Object Segmentation in Videos

  • 1.
  • 2.
  • 3.
    • Divide datainto meaningful segments Segmentation Superpixel Image segmentation Video segmentation Video object segmentation
  • 4.
    Video Object Segmentation •Semi-supervised video object segmentation • Primary object segmentation • Multiple object segmentation
  • 5.
    Semi-supervised Video ObjectSegmentation • Track and segment a target object • Annotated by a user in the first frame First frame & user annotation Segment track
  • 6.
    Primary Object Segmentation •Segment a primary object in a video automatically Primary object: Diver Primary object: Tennis player
  • 7.
    Multiple Object Segmentation •Extract multiple segment tracks as many as possible
  • 8.
  • 9.
    Primary Object Segmentation •Primary object segmentation • Initial region estimation • Motion boundaries • Object proposal • Saliency maps • Refinement • Construct models for the primary object and the background, e.g. Gaussian mixture models (GMMs) • Propose augmentation and reduction process (ARP)
  • 10.
    Primary Object Segmentationin Videos Based on Region Augmentation and Reduction • Overview • Input: A set of consecutive video frames • Output: A set of pixel-wise segments to delineate the primary object
  • 11.
    Candidate Region Generation •Candidate regions • Ultrametric contour map (UCM) • Obtain color-based and motion-based UCMs • Each region in UCM becomes a superpixel
  • 12.
    Candidate Region Generation •Candidate regions • Generate candidate regions by merging neighboring superpixels • Determine the pair, 𝑠 𝑚 and 𝑠 𝑛, sharing the weakest boundary • Merge 𝑠 𝑚 and 𝑠 𝑛 in a single superpixel • Repeat this process only one superpixel remains
  • 13.
    Candidate Region Generation •Foreground confidence • Measure the foreground confidence of each candidate region • Appearance confidence 𝜙𝑖 (𝑡) • Obtain a saliency map using technique in [1] • Average the saliency values within the candidate region • Edge confidence 𝜓𝑖 (𝑡) • Combine color-based edge map and motion-based edge map 𝑐𝑖 (𝑡) = 𝜙𝑖 (𝑡) + 𝜓𝑖 (𝑡) [1] W.-D. Jang, C. Lee, and C.-S. Kim, “Primary object segmentation in videos via alternate convex optimization of foreground and background distributions,” CVPR, 2016
  • 14.
    Candidate Region Generation •Foreground confidence • Select the top 20 candidate regions • Warp the selected candidate regions to neighboring frames • Rearrange the set of candidate regions 𝒬(𝑡) = 𝑞1 𝑡 , 𝑞2 𝑡 , … , 𝑞 𝑁 (𝑡) • Feature description • Describe the feature 𝐟𝑖 (𝑡) of each candidate region 𝑞𝑖 (𝑡) using the bag-of-visual-words approach
  • 15.
    Initial Region Estimation •Selecting initial primary object regions • Choose the main region 𝑞 𝛿 (𝑡) among candidate regions • Exploit the recurrence property that a primary object appears repeatedly in a video sequence Input frames Candidate region generation Initial region estimation
  • 16.
    Initial Region Estimation •Selecting initial primary object regions • Assume that feature of main region 𝑞 𝛿 (𝑡) should be similar to features of the main regions in the other frames • 𝐩 𝜏 denotes the feature of the main region in frame 𝐼(𝜏) 𝛿 = arg min ෍ 𝜏=1,𝜏≠𝑡 𝑑 𝜒 𝐟𝑖 (𝑡) , 𝐩 𝜏 Input frames Candidate region generation Initial region estimation
  • 17.
    Initial Region Estimation •Selecting initial primary object regions • Initialization of 𝐩 𝜏 • Superpose features of all candidate region in 𝒬(𝜏) • Combine features of candidate regions, 𝐅(𝜏) = 𝐟1 𝜏 , … , 𝐟 𝑁 𝜏 , using the foreground confidence vector 𝐜(𝜏) = 𝑐1 𝜏 , … , 𝑐 𝑁 𝜏 𝑇 • Obtain the main region 𝑞 𝛿 (𝑡) by applying 𝐩 𝜏 for each frame • Alternative update of the main regions • Update 𝐩 𝑡 for each frame by 𝐩 𝑡 ← 𝐟𝛿 𝜏 • Choose the main region using the updated features 𝐩 𝜏 = 𝐅(𝜏) 𝐜(𝜏) 𝛿 = arg min ෍ 𝜏=1,𝜏≠𝑡 𝑑 𝜒 𝐟𝑖 (𝑡) , 𝐩 𝜏
  • 18.
    Primary Object RegionRefinement • Refinement of primary object regions • Initial regions may exclude parts of primary objects or include noisy regions (background or other objects) • Attempt to refine initial regions • Augment initial regions with missing region • Reducing initial regions by removing noisy regions
  • 19.
    Primary Object RegionRefinement • Augmented regions • Augment initial regions 𝑞 𝛿 𝑡 with candidate region 𝑞𝑖 𝑡 in 𝒬(𝑡) • Reduced regions • Reduce initial regions 𝑞 𝛿 𝑡 using candidate region 𝑞 𝑗 𝑡 in 𝒬(𝑡) 𝑞 𝛿 𝑡 𝑞𝑖 𝑡 𝑞𝑖 𝑡 𝑞 𝛿 𝑡 𝑟𝑖 𝑡 = 𝑞 𝛿 𝑡 ∪ 𝑞𝑖 𝑡 𝑞 𝛿 𝑡 𝑞 𝑗 𝑡 𝑞 𝛿 𝑡 𝑞 𝑗 𝑡 𝑟𝑗 𝑡 = 𝑞 𝛿 𝑡 ∩ 𝑞 𝑗 𝑡
  • 20.
    Primary Object RegionRefinement • Augmentation and reduction process (ARP) • Determine whether to augment or reduce 𝑞 𝛿 𝑡 by cost function • Data cost • Constrain that the refined region 𝑟𝑖 (𝑡) should be similar to initial regions in all frames • Segmentation cost • Make the refined region 𝑟𝑖 (𝑡) as dissimilar from its nearby background as possible 𝐶 𝑟𝑖 (𝑡) = 𝐶data 𝑟𝑖 (𝑡) + 𝛾 ⋅ 𝐶seg 𝑟𝑖 (𝑡) 𝐶data 𝑟𝑖 (𝑡) = 1 𝑇 ෍ 𝜏=1 𝑑 𝜒 𝐟r,𝑖 (𝑡) , 𝐟𝛿 (𝑡) 𝐶seg 𝑟𝑖 (𝑡) = −𝑑 𝜒 𝐟r,𝑖 (𝑡) , 𝐟b,𝑖 (𝑡)
  • 21.
    Primary Object RegionRefinement • Augmentation and reduction process (ARP) • Minimize the cost function for the optimal refined region • Perform ARP iteratively • Construct the set of augmented and reduced regions again by employing 𝑟∗ 𝑡 as the initial region • Find the optimal 𝑟∗ 𝑡 by minimizing 𝐶 𝑟𝑖 (𝑡) • Repeat until 𝑟∗ 𝑡 is unchanged 𝑟∗ 𝑡 = arg min 𝐶 𝑟𝑖 (𝑡)
  • 22.
    Primary Object RegionRefinement • Augmentation and reduction process (ARP)
  • 23.
    • DAVIS dataset[2] • 50 video sequences (3,455 annotated frames) • Performance measure • Region similarity 𝒥: Intersection over union • Contour accuracy ℱ: F-measure that is the harmonic mean of the contour precision and recall rates Experimental results [2] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” CVPR 2016
  • 24.
    Experimental results • Impactsof ARP • Compare ARP with the conventional refinement techniques [20, 36] • Apply refinement techniques to our initial regions (IR) [20] A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained video,” ICCV,2013. [36] D. Zhang, O. Javed, and M. Shah, “Video object segmentation through spatially accurate and temporally dense extraction of primary object regions,” CVPR, 2013.
  • 25.
    Experimental results • Quantitativecomparison • Semi-supervised: Human annotation at the first frame • Multiple VOS: Output multiple objects • POS: Output primary object objects
  • 26.
  • 27.
  • 28.
    Multiple Object Segmentation •Multiple object segmentation • Motion segmentation • Cluster point trajectories in a video • Video object proposal • Proposal matching • Proposal clustering • Segmentation guided by object detection and tracking
  • 29.
    CDTS: Collaborative Detection,Tracking, and Segmentation for Online Multiple Object Segmentation in videos • Overview • Input: A set of consecutive video frames • Output: Multiple segment tracks Input frames Detection and tracking results Joint detection and tracking ASE segmentationObject track generation
  • 30.
    Object Track Generation •Joint detection and tracking • Detector [3] • Find object location without manual annotations • Some objects may remain undetected • Tracker [4] • Boost the recall rate of objects using temporal correlations • Three cases • Both detection and tracking boxes • Only detection box • Only tracking box [3] Y. Li, K. He, J. Sun, et al. “R-FCN: Object detection via region-based fully convolutional networks,” NIPS, 2016 [4] H.-U. Kim, D.-Y. Lee, J.-Y. Sim, and C.-S. Kim, “SOWP: Spatially ordered and weighted patch descriptor for visual tracking,” ICCV, 2015.
  • 31.
    Object Track Generation •Joint detection and tracking • Both detection and tracking boxes • Match detection and tracking boxes • The Hungarian algorithm • Choose the more accurate box for each matching pair • Link the selected box to the corresponding object track • Unmatched detection box • Regard as newly appearing object • Unmatched tracking box • Link to the corresponding object track
  • 32.
    ASE Segmentation • Alternateshrinking and expansion (ASE) • Over-segment frame in to superpixels • Dichotomize each superpixel within and near the box into either foreground or background class
  • 33.
    ASE Segmentation • Over-segmentation •Obtain superpixels using UCM • Preliminary classification • Exploit overlap ratio between the box and each superpixel • Refine preliminary foreground regions
  • 34.
    ASE Segmentation • Intra-framerefinement • Constrain foreground regions to have intense edge strengths • Boundary cost • Shrink foreground regions by remove superpixels to minimize the boundary cost in a greedy manner 𝐶bnd 𝐹𝑖 (𝑡) = − ෍ 𝐱∈𝜕𝐹𝑖 (𝑡) 𝑈 𝑡 𝐱
  • 35.
    ASE Segmentation • Inter-framerefinement • Constrain that the refined region should be similar to the segmentation results in previous frames • Cost function • Expand foreground regions by augmenting superpixels • Perform shrinking in a similar way 𝐶inter 𝐹𝑖 (𝑡) , ℬ𝑖 (𝑡) = 𝛼 ⋅ 𝐶tmp 𝐹𝑖 𝑡 + 𝐶seg 𝐹𝑖 (𝑡) , ℬ𝑖 (𝑡) +𝐶bnd 𝐹𝑖 (𝑡)
  • 36.
  • 37.
    Experimental Results • YouTube-Objectsdataset • Contain 126 videos for 10 object classes • Performance measure • Intersection over union (IoU) [34] Y.-H. Tsai, G. Zhong, and M.-H. Yang, “Semantic cosegmentation in videos.,” ECCV,2016. [42] Y. Zhang, X. Chen, J. Li, C. Wang, and C. Xia, “Semantic object segmentation via detection in weakly labeled video,” CVPR 2015.
  • 38.
  • 39.