Unconstrained 2D to
                 Stereoscopic 3D Image and
                 Video Conversion using
                 Semi-Automatic Energy
                 Minimization Techniques
                      Raymond Phan, Richard Rzeszutek
                      and Dimitrios Androutsos
                      Dept. of Electrical & Computer Engineering
                      Ryerson University – Toronto, Ontario, Canada
Thursday, October 24th, 2012
Outline of Presentation
• Introduction & Motivation
• Conversion Framework for Images
  – Random Walks
  – Graph Cuts / Depth Priors
• Conversion Framework for Video
  – Keyframes to Label
  – Label Tracking
• Results – Images and Videos
                                © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Introduction
• Creating stereoscopic content from single-
  view footage / 2D to 3D Conversion
  – Huge surge of popularity: Converting legacy
    content is very appealing
  – Current accepted method is quite accurate, but
    very labor intensive and difficult
  – Known as rotoscoping: Manual operators extract
    objects in a single frame to create left-right views
  – Much research in 2D to 3D conversion performed
    to alleviate difficulty, minimize time & cost
                                          © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Introduction – (2)
• Goal of 2D to 3D Conversion: Depth Map
  – B & W image showing depth at each point
  – Depth maps are the main tool for conversion
• Ultimate Goal: Automatic Conversion
  – Most methods concentrate here
  – Problem: Errors cannot be easily corrected
  – May require extensive pre/post-processing
• Solution? Semi-Automatic
  – Happy medium between auto & manual
                                       © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Motivation
• Semi-auto: Some user effort, rest is automatic
  – User marks objects/regions in the image on what is
    close/far from camera (dark/light intensities or colors)
  – For video: Mark several keyframes
     • Allow for label propagation from first frame minimizing user effort
  – Using above info, goal is to solve for the other depths in
    the entire image, or entire video
  – Results: Single depth map or a sequence of them
• How do we solve?
  – Using mixture of two semi-automatic segmentation
    algorithms: Random Walks & Graph Cuts
                                                  © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Images
• Random Walks: Energy Minimization Scheme
  – Starting from a user-defined label, what is the likelihood
    of a random walker visiting all unlabeled pixels in image
  – Goal: Classify every pixel to belong to one of K labels
  – Pixel gets the label generating the highest likelihood
• Modify Random Walks to create depth maps
  – Likelihoods are probabilities: Spans the set [0,1]
  – User-defined depths and solved depths spans same set
  – Goal is to solve for one label: The depth!
                                           © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Images – (2)
   – Use Scale-Space Random Walks in our framework
      • Pyramidal sampling scheme, with Random Walks applied to
        each resolution  Merged via Geometric Mean
   – User chooses values from [0,1] and brushes over image
   – 0: Dark intensity / color, 1: Light intensity / color
   – Resulting solved probabilities are directly used as depths
• Is this valid?
   – Yes! Psych. study done at Tel Aviv Uni. in Israel
   – As long as the user is perceptually consistent in marking

                                               © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Images – (3)
• Random Walks does have its issues though
  – Allows for internal depth variation, but for weak borders,
    results in a “bleeding effect”
  – Regions of one depth may leak into regions of another




                                           © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Images – (4)
• Internal depth variation = Good!
  – Minimizes cardboard cutout effect
  – RW generates depths not originally defined by user
  – But, we need to respect object boundaries
• Idea: Combine Random Walks with Graph Cuts
  – Graph Cuts = Hard Segmentation
     • Only creates result with depths/labels provided by user
  – GC solves the MAP-MRF problem with user labels
  – Consider image as a weighted connected graph
     • Solution is to solve the max-flow/min-cut of this graph
                                                   © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Images – (5)
• NB: Making depth maps = Segmentation Problem
  – Specifically, a multi-label segmentation problem
  – But! Graph Cuts: Binary segmentation problem (FG/BG)
  – Graph Cuts also has an integer labeling, not from [0,1]
• Must modify above to become multi-label
  – Each unique user-defined label is given an integer label
  – Binary segmentation is performed for each label
     • FG: Label in question, BG: All of the other labels
  – Rest of the pixels are those we wish to label
                                                    © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Images – (6)
  – For each label, the maximum flow values are stored
  – Label assigned to a pixel is the one with the highest flow
• Problems?
  – Object boundaries respected, but no depth variation




                                           © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Images – (7)
• But! We can make use of this
  –   Merge Random Walks & Graph Cuts together
  –   Create a depth prior: An initial depth map estimate
  –   This is essentially Graph Cuts!
  –   We merge by feeding depth prior as additional information into RW
• Before we merge…
  – Depth maps for RW and GC must be compatible with each other
       • RW has depths of [0,1], GC has integer labels
  – Map the user-defined labels from RW of set [0,1] to an integer set
  – Perform Graph Cuts on using this integer set, and map the integer
    set back to the set of [0,1]  Use a lookup table to do this

                                                         © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Images – (8)
• Summary of Method:
  – Place user-defined strokes on image & create depth prior
  – Feed depth prior with same strokes into RW and solve
  – To modify depth map, append strokes & re-run algorithm




                                         © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Video
• Essentially the same as images, but we need to:
  – Mark more than one image / keyframe
     • Result will be a sequence of depth maps
  – Assume no abrupt changes in video
     • If there are, separate manually, or use shot detection
• Also must be aware of memory constraints
  – Intuitive to process each frame individually
     • Fits well with memory, and can compute depth maps in parallel
     • However, this breaks temporal relationship  Flickering
  – Ideal to process all frames simultaneously in memory
     • But this will exhaust all available memory!
                                                     © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Video – (2)
• How do we solve?
  – Use block processing to preserve temporal coherency
  – Process blocks of frames without exhausting memory
  – Overlapping frames within blocks are left unused
     • Each block is independent of the others
• Back to labeling: How many frames do we label?
  – We allow the user the option of manually choosing
    which ones to label
  – However, labeling only a small set of frames will result in
    depth artifacts
                                                 © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Video – (3)
• Example – From Sintel Sequence (http://www.sintel.org)
   – Shows what happens with 3 frames labeled & when all are labeled




  Depth
  maps:
  3 frames
  labeled
  only



                                                 © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Video – (4)

  Depth
  maps:
  All frames
  labeled




• Labeling all frames is better
   – Not doing this results in depth artifacts
   – For frames having no labels, moving points quickly “fade” in depth
   – Labeling all frames is better, but can be very time consuming!
                                                  © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Video – (5)
• Labeling all frames is better – Part II
  – Instead of manually labeling all frames, label first frame
    and use a tracking algorithm to propagate the labels
  – Adjust depths of labels when object depths change
• Label Tracking?
  –   Would be very taxing to track all points in a stroke
  –   Decompose a stroke at a particular depth into N points
  –   Track each of these points separately
  –   Reconstruct the stroke using spline interpolation
                                           © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Video – (6)
• Tracker used: Tracking-Learning-Detection (TLD)
  – Long term tracker for unconstrained video by Kalal et al.
  – Simply draw a bounding box around object in first frame
  – After, trajectory is determined for the rest of the frames,
    accounting for size and illumination changes
• How do we use this?
  – For each point in each decomposed stroke, surround a
    bounding box and track the region
  – Reconstruct each stroke using the tracked points

                                           © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conversion Framework – Video – (7)
• Modify TLD to account for object depth changes
  – Let S = {s0, s1, …, si, …, sM-1} represent the scales of
    initial bounding box drawn for stroke point
  – Create a mapping function correlating depth with
    scale  Smaller the bounding box, farther the depth
  – We know: s0/sM-1  farthest/closest user-defined
    depth d0/dmax, & scale 1.0 is the depth of stroke, du
  – Assume parabolic relationship D(x) = ax2 + bx + c, &
    solve:                    x = the bounding box scale

                                         © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Results - Images
Single Frame from Avatar Trailer




                                    © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Results – Images – (2)
Shots of downtown Boston




                                © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Results – Videos – (1)
Big Buck Bunny – User-defined frame #1 + tracked frames

        Frame #1




                                         © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Results – Videos – (2)
Big Buck Bunny – Depth Maps

        Frame #1




                                © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Results – Videos – (3)
Shell-Ferrari-Partizan – User-defined frame #1 + tracked frames

Frame
#1




                                              © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Results – Videos – (4)
Shell-Ferrari-Partizan – Depth Maps

Frame
#1




                                      © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
Conclusions
• Made a semi-auto method for 2D-3D conversion
  –   Auto: Needs error correction & pre/post processing
  –   Manual: Time-consuming and expensive
  –   Happy medium between the two
  –   Allows user to correct errors instantly and re-run fast
• Works for both images and video
  – Merged two segmentation algorithms together
       • Combine merits of both methods together for better accuracy
  – Video: Modified a robust tracking algorithm to track
    user-defined labels as well as dynamically adjust depths
                                                 © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org

Unconstrained 2D to Stereoscopic 3D Image and Video Conversion using Semi-Automatic Energy Minimization Techniques

  • 1.
    Unconstrained 2D to Stereoscopic 3D Image and Video Conversion using Semi-Automatic Energy Minimization Techniques Raymond Phan, Richard Rzeszutek and Dimitrios Androutsos Dept. of Electrical & Computer Engineering Ryerson University – Toronto, Ontario, Canada Thursday, October 24th, 2012
  • 2.
    Outline of Presentation •Introduction & Motivation • Conversion Framework for Images – Random Walks – Graph Cuts / Depth Priors • Conversion Framework for Video – Keyframes to Label – Label Tracking • Results – Images and Videos © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 3.
    Introduction • Creating stereoscopiccontent from single- view footage / 2D to 3D Conversion – Huge surge of popularity: Converting legacy content is very appealing – Current accepted method is quite accurate, but very labor intensive and difficult – Known as rotoscoping: Manual operators extract objects in a single frame to create left-right views – Much research in 2D to 3D conversion performed to alleviate difficulty, minimize time & cost © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 4.
    Introduction – (2) •Goal of 2D to 3D Conversion: Depth Map – B & W image showing depth at each point – Depth maps are the main tool for conversion • Ultimate Goal: Automatic Conversion – Most methods concentrate here – Problem: Errors cannot be easily corrected – May require extensive pre/post-processing • Solution? Semi-Automatic – Happy medium between auto & manual © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 5.
    Motivation • Semi-auto: Someuser effort, rest is automatic – User marks objects/regions in the image on what is close/far from camera (dark/light intensities or colors) – For video: Mark several keyframes • Allow for label propagation from first frame minimizing user effort – Using above info, goal is to solve for the other depths in the entire image, or entire video – Results: Single depth map or a sequence of them • How do we solve? – Using mixture of two semi-automatic segmentation algorithms: Random Walks & Graph Cuts © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 6.
    Conversion Framework –Images • Random Walks: Energy Minimization Scheme – Starting from a user-defined label, what is the likelihood of a random walker visiting all unlabeled pixels in image – Goal: Classify every pixel to belong to one of K labels – Pixel gets the label generating the highest likelihood • Modify Random Walks to create depth maps – Likelihoods are probabilities: Spans the set [0,1] – User-defined depths and solved depths spans same set – Goal is to solve for one label: The depth! © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 7.
    Conversion Framework –Images – (2) – Use Scale-Space Random Walks in our framework • Pyramidal sampling scheme, with Random Walks applied to each resolution  Merged via Geometric Mean – User chooses values from [0,1] and brushes over image – 0: Dark intensity / color, 1: Light intensity / color – Resulting solved probabilities are directly used as depths • Is this valid? – Yes! Psych. study done at Tel Aviv Uni. in Israel – As long as the user is perceptually consistent in marking © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 8.
    Conversion Framework –Images – (3) • Random Walks does have its issues though – Allows for internal depth variation, but for weak borders, results in a “bleeding effect” – Regions of one depth may leak into regions of another © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 9.
    Conversion Framework –Images – (4) • Internal depth variation = Good! – Minimizes cardboard cutout effect – RW generates depths not originally defined by user – But, we need to respect object boundaries • Idea: Combine Random Walks with Graph Cuts – Graph Cuts = Hard Segmentation • Only creates result with depths/labels provided by user – GC solves the MAP-MRF problem with user labels – Consider image as a weighted connected graph • Solution is to solve the max-flow/min-cut of this graph © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 10.
    Conversion Framework –Images – (5) • NB: Making depth maps = Segmentation Problem – Specifically, a multi-label segmentation problem – But! Graph Cuts: Binary segmentation problem (FG/BG) – Graph Cuts also has an integer labeling, not from [0,1] • Must modify above to become multi-label – Each unique user-defined label is given an integer label – Binary segmentation is performed for each label • FG: Label in question, BG: All of the other labels – Rest of the pixels are those we wish to label © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 11.
    Conversion Framework –Images – (6) – For each label, the maximum flow values are stored – Label assigned to a pixel is the one with the highest flow • Problems? – Object boundaries respected, but no depth variation © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 12.
    Conversion Framework –Images – (7) • But! We can make use of this – Merge Random Walks & Graph Cuts together – Create a depth prior: An initial depth map estimate – This is essentially Graph Cuts! – We merge by feeding depth prior as additional information into RW • Before we merge… – Depth maps for RW and GC must be compatible with each other • RW has depths of [0,1], GC has integer labels – Map the user-defined labels from RW of set [0,1] to an integer set – Perform Graph Cuts on using this integer set, and map the integer set back to the set of [0,1]  Use a lookup table to do this © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 13.
    Conversion Framework –Images – (8) • Summary of Method: – Place user-defined strokes on image & create depth prior – Feed depth prior with same strokes into RW and solve – To modify depth map, append strokes & re-run algorithm © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 14.
    Conversion Framework –Video • Essentially the same as images, but we need to: – Mark more than one image / keyframe • Result will be a sequence of depth maps – Assume no abrupt changes in video • If there are, separate manually, or use shot detection • Also must be aware of memory constraints – Intuitive to process each frame individually • Fits well with memory, and can compute depth maps in parallel • However, this breaks temporal relationship  Flickering – Ideal to process all frames simultaneously in memory • But this will exhaust all available memory! © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 15.
    Conversion Framework –Video – (2) • How do we solve? – Use block processing to preserve temporal coherency – Process blocks of frames without exhausting memory – Overlapping frames within blocks are left unused • Each block is independent of the others • Back to labeling: How many frames do we label? – We allow the user the option of manually choosing which ones to label – However, labeling only a small set of frames will result in depth artifacts © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 16.
    Conversion Framework –Video – (3) • Example – From Sintel Sequence (http://www.sintel.org) – Shows what happens with 3 frames labeled & when all are labeled Depth maps: 3 frames labeled only © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 17.
    Conversion Framework –Video – (4) Depth maps: All frames labeled • Labeling all frames is better – Not doing this results in depth artifacts – For frames having no labels, moving points quickly “fade” in depth – Labeling all frames is better, but can be very time consuming! © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 18.
    Conversion Framework –Video – (5) • Labeling all frames is better – Part II – Instead of manually labeling all frames, label first frame and use a tracking algorithm to propagate the labels – Adjust depths of labels when object depths change • Label Tracking? – Would be very taxing to track all points in a stroke – Decompose a stroke at a particular depth into N points – Track each of these points separately – Reconstruct the stroke using spline interpolation © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 19.
    Conversion Framework –Video – (6) • Tracker used: Tracking-Learning-Detection (TLD) – Long term tracker for unconstrained video by Kalal et al. – Simply draw a bounding box around object in first frame – After, trajectory is determined for the rest of the frames, accounting for size and illumination changes • How do we use this? – For each point in each decomposed stroke, surround a bounding box and track the region – Reconstruct each stroke using the tracked points © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 20.
    Conversion Framework –Video – (7) • Modify TLD to account for object depth changes – Let S = {s0, s1, …, si, …, sM-1} represent the scales of initial bounding box drawn for stroke point – Create a mapping function correlating depth with scale  Smaller the bounding box, farther the depth – We know: s0/sM-1  farthest/closest user-defined depth d0/dmax, & scale 1.0 is the depth of stroke, du – Assume parabolic relationship D(x) = ax2 + bx + c, & solve:  x = the bounding box scale © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 21.
    Results - Images SingleFrame from Avatar Trailer © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 22.
    Results – Images– (2) Shots of downtown Boston © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 23.
    Results – Videos– (1) Big Buck Bunny – User-defined frame #1 + tracked frames Frame #1 © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 24.
    Results – Videos– (2) Big Buck Bunny – Depth Maps Frame #1 © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 25.
    Results – Videos– (3) Shell-Ferrari-Partizan – User-defined frame #1 + tracked frames Frame #1 © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 26.
    Results – Videos– (4) Shell-Ferrari-Partizan – Depth Maps Frame #1 © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org
  • 27.
    Conclusions • Made asemi-auto method for 2D-3D conversion – Auto: Needs error correction & pre/post processing – Manual: Time-consuming and expensive – Happy medium between the two – Allows user to correct errors instantly and re-run fast • Works for both images and video – Merged two segmentation algorithms together • Combine merits of both methods together for better accuracy – Video: Modified a robust tracking algorithm to track user-defined labels as well as dynamically adjust depths © 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org