MANHATTAN SCENE UNDERSTANDINGUSING MONOCULAR, STEREO, AND 3D           FEATURES         Alex Flint, David Murray, and Ian ...
SEMANTICS IN GEOMETRIC MODELS                                     1. Motivation                                     2. Pri...
MOTIVATION    Single View Computer Vision                                                                                 ...
MOTIVATION The multiple view setting is increasingly relevant    • Powerful mobile devices with cameras    • Bandwidth no ...
MOTIVATION  We seek a representation that:  • leads naturally to semantic-level scene understanding tasks;  • integrates b...
Where would a person stand?                                               Where would doors be found?                     ...
Goal is to ignore clutter
PRIOR WORK                                                             Target image        Depth map                      ...
geometry of a scene. For example, Kosaka and Kak [11]                      rithm using the imag                           ...
PROBLEM STATEMENT   Given:    •   K views of a scene    •   Camera poses from structure-from-motion    •   Point cloud   R...
ck        k                                X ) and q X (x, y 0 )in column x be          px = = yx y ) x = (M, i)          ...
he camera intrinsics are unknown then we construct the camera matrix                                                      ...
log P (X | M )           M = {c1 , (r1 , a1 ), . . . , ck   1 , (rk 1 , ak 1 ), ck }       (33)                           ...
x                 i=0          8                  ck < W                                          (38)     Prior>log(     ...
2, 3} for each pixel, with < W corresponding to the                            ck values                    (38)e Manhatta...
1                n1       n2       n3                                                      P (M ) =              1        ...
3.4. Combinin   Likelihood For Point Cloud Features                                                                       ...
payoff m                                              p, yx ) = 1                                       P (d | P (M) = P (...
model by assuming conditional independence given M ,                    ˆ M = argmax P (M )P (Xmono | M )P (Xstereo | M )P...
er of this paper if we can assume that verticalworld appear vertical in the image. To this ende simple rectification proced...
t a mul-       error is a per–pixel average of (23).                                                              ⌃fup (x,...
RESULTSInput  • 3 frames sampled at 1 second intervals  • Camera poses from SLAM  • Point cloud (approx. 100 points)Datase...
RESULTSAlex Flint, David Murray, Ian Reid    “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
RESULTS        Algorithm                          Mean depth error (%)                     Labeling error (%)        Our a...
RESULTS                                     Stereo Features                                         730ms                 ...
RESULTS                 Sparse texture                              Non-ManhattanAlex Flint, David Murray, Ian Reid    “Ma...
RESULTS                                     Poor Lighting ConditionsAlex Flint, David Murray, Ian Reid           “Manhatta...
RESULTS                                      ClutterAlex Flint, David Murray, Ian Reid    “Manhattan Scene Understanding U...
RESULTS                                     Failure CasesAlex Flint, David Murray, Ian Reid     “Manhattan Scene Understan...
RESULTS                                     Failure CasesAlex Flint, David Murray, Ian Reid     “Manhattan Scene Understan...
SUMMARY•   We wish to leverage multiple-view geometry for scene understanding.•   Indoor Manhattan models are a simple and...
Upcoming SlideShare
Loading in...5
×

ICCV 2011 Presentation

1,825

Published on

Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features

Published in: Technology, Art & Photos
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,825
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
26
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

ICCV 2011 Presentation

  1. 1. MANHATTAN SCENE UNDERSTANDINGUSING MONOCULAR, STEREO, AND 3D FEATURES Alex Flint, David Murray, and Ian Reid University of Oxford
  2. 2. SEMANTICS IN GEOMETRIC MODELS 1. Motivation 2. Prior work 3. The indoor Manhattan representation 4. Probabilistic model and inference 5. Results and conclusionAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  3. 3. MOTIVATION Single View Computer Vision Multiple View Geometry Sky Tree Water Rock classroom (2.09) classroom (1.99) classroom (1.98) fastfood (−0.18) garage (−0.69) bathroom (−0.99) kitchen (−1.27) Human classroom Sand Beach restaurant (1.57) livingroom (1.55) pantry (1.53) fastfood (−0.12) waitingroom (−0.59) restaurant (−0.89) kitchen (−1.16) dining room bathroom (2.45) bathroom (2.14) bedroom (2.01) laundromat (0.36) operating room(−0.23) dental office (−0.65) bookstore (−1.04) locker room hospitalroom locker room (2.52) corridor (2.27) locker room (2.22) office (−0.04) prisoncell (−0.52) kindergarden (−0.86) bathroom (−1.16) mall (1.69) videostore (1.44) videostore (1.39) tv studio (−0.14) bathroom (−0.51) concert hall (−0.78) concert hall (−1.01) i toreAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  4. 4. MOTIVATION The multiple view setting is increasingly relevant • Powerful mobile devices with cameras • Bandwidth no longer constrains video on the internet • Depth sensing cameras becoming increasingly prevalent Structure-from-motion does not immediately solve: • Scene categorisation • Object recognition • Many scene understanding tasksAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  5. 5. MOTIVATION We seek a representation that: • leads naturally to semantic-level scene understanding tasks; • integrates both photometric and geometric data; • is suitable for both monocular and multiple-view scenarios.The indoor Manhattan representation (Lee et al, 2009) • Parallel floor and ceiling planes • Walls terminate at vertical boundaries • A sub-class of Manhattan scenes Lee, Kanade, Hebert, “Geometric reasoning for single image structure recovery”, CVPR 2009Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  6. 6. Where would a person stand? Where would doors be found? What is the direction of gravity? Is this an office or house? How wide (in absolute units)?Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  7. 7. Goal is to ignore clutter
  8. 8. PRIOR WORK Target image Depth map Depth normal map Mesh• Kosecka and Zhang, “Video Compass”, ECCV 2002• Furukawa, Curless, Seitz, and Szeliski, “Manhattan World Stereo”, CVPR 2009 2.2 Context in Robotics 11• Posner, Schroeter, and Newman, “Online generation of scene descriptions in urban environments”, RAS 2008 2.3 Context in Computer Vision 13• Vasudevan, Gachter, Nguyen, Siegwart, “Cognitive maps for mobile Figure 2.3: Semantic labels output by the system of Posner et al [PSN08]. 2.2.2 Map–centric approaches robots -- an object-based approach”, RAS 2007 An alternative approach to deriving context in robotics applications is to integrate new mea- surements into a map, and then reason about semantics within the map representation. In general this approach enables stronger integration of measurements taken over several time steps, at the cost of relying on the ability to correctly build a map. Buschka and Saffiotti [BS02] have taken a map–centric approach to the problem of identi-• Bao and Savarese, “Semantic Structure From Motion”, CVPR 2011 fying room boundaries within indoor environments and recognising the resultant rooms. A series of laser range scans are fused into a 2D occupancy grid representing the probability that each cell is occupied by some object or boundary. Rooms boundaries are identified by applying dilation and erosion to the occupancy map, which are standard morphological fil- ters from visual segmentation [FP02]. The authors demonstrate that this can be performed with fixed computational cost by discarding old parts of the environment as the robot moves Figurethrough the environment. 2.4: Example of an object–centric map of [VGNS07]. The blue triangles show object detections, the red and green stars show doorways the system has identified, and the red The result of their algorithm is a series of “nodes” with topological connections between dot shows the robot’s inferred place category for the outlined room, which in this case is an them, which correspond to the various rooms and corridors within the robot’s environment office. and while the doorways that connect them. The authors goal it is still instructive to reviewthe and this is not aligned exactly with our own proceed to characterise each node by these Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features” contributions because the ideas they propose for inferring context are often separable from
  9. 9. geometry of a scene. For example, Kosaka and Kak [11] rithm using the imag presented a navigation algorithm that allows a monocular fine the “floor-wall” g robot to track its position in a building by associating vi- recovering 3d inform sual cues, such as lines and corners, with the configura- its training process, an tion of hallways on a plan. However, this approach would likely floor-wall bound fail in a new environment where the plan of the room is we present a quantitat not available beforehand. To succeed more generally, one struction on test imag PRIOR WORK needs to rely on a more flexible geometric model. With a the algorithm by appl Manhattan world assumption on a given scene (i.e. one that ages. contains many orthogonal shapes, like in many urban en- vironments), Coughlan and Yuille [3], and Schindler and 2. Background M Dellaert [16] have developed efficient techniques to recover autonomously both extrinsic and intrinsic camera param- eters from a single image. Another successful attempt in In this paper, we fo the field of monocular 3d reconstruction was developed by scenes of the sort that Han and Zhu [7, 8], which used models both of man-made bile robot. We make “block-shaped objects” and of some natural objects, such as camera: trees and grass. Unfortunately, this approach has so far been 1. The image is ob applied only to fairly simple images, and seems unlikely ing a calibrated c to scale in its present form to complex, textured images as Thus, as present shown in Figure 1. world is projecte in homogeneous if:3• Delage, Lee, and Ng, “A dynamic Bayesian network for Make3D: Learning 3D Scene Structure2. fromtoconta Object Detection The image sponding N d a autonomous 3d reconstruction from a single indoor the floor plane. (F Single Still Image which all surface image”, CVPR 2006 2 A calibrated camera m Ashutosh Saxena, Min Sun and Andrew Y. Ng 1 to the optical axis is known 3 Here, K, q and Q are Figure 2. 3d reconstruction of a corridor from   f 0 ∆u single image presented in Figure 1 using our Make3D: Learning 3D Scene Structure from a Abstract— We consider the problem autonomous algorithm. of estimating detailed 3-d structure from a single still image of an unstructured K=  0 f ∆v  , 0 0 1 Thus, Q is projected onto a• Hoiem, Efros, and Ebert, “Geometric context from a singleSingle Still Hoiemthat focuses also generating aesthetically pleasing Image on developed independently an al- gorithm et al. [9] environment. Our goal is to create 3-d models which are both quantitatively accurate as well as visually pleasing. is some constant α so that Q 4 Vanishing points in the are parallel in 3d space mee “pop-up book” versions of outdoor pictures. Although their For each small homogeneous patch in the image, we use a image”, CVPR 2005 Ashutosh Saxena, Min Sun and Andrew Y. in spirit, it is different from ours in de- algorithm is related Ng Markov Random Field (MRF) to infer a set of “plane parame- ters” that capture both the 3-d location and 3-d orientation of the tail. We will describe a comparison of our method with perspective geometry. Beca cial scenes, they form impo that has mainly orthogonal patch. The MRF, trained via supervised learning, models both image depth cues as well as the relationships between different parts of the image. Other than assuming that the environment Abstract— We consider the problem of estimating detailed is made up of a number of small planes, our model makes no 3-d structure from a single still image of an unstructured• Saxena, Sun, and Ng, “Make3d: Learning 3D scene structure explicit assumptions about the structure of the scene; this enables environment. Our goal is to create 3-d models captureare both the algorithm to which much more detailed 3-d structure than quantitatively accurate as well does prior art, and also give a much richer experience in the 3-d as visually pleasing. from a single still image, PAMI 2008 For each small homogeneous patch in the image, we use a flythroughs created using image-based rendering, even for scenes Markov Random Field (MRF) to infer a set of “plane parame- with significant non-vertical structure. ters” that capture both the 3-d location and 3-d orientation have created qualitatively correct 3-d Using this approach, we of the patch. The MRF, trained via supervised learning, models both downloaded from the internet. models for 64.9% of 588 images image depth cues as well as the relationships extended different We have also between our model to produce large scale 3d parts of the image. Other than assuming thatfew images.1 models from a the environment is made up of a number of small planes, our model makes no Fig. 1. (a) An original image. (b) Oversegmentation of the image to• Lee, Kanade, Hebert, “Geometric reasoning for single image explicit assumptions about the structure Terms— Machineenables Monocular vision, Learning Index of the scene; this learning, “superpixels”. (c) The 3-d model predicted by the algorithm. (d) A scre the algorithm to capture much depth, detailed and structure than more Vision 3-d Scene Understanding, Scene Analysis: Depth of the textured 3-d model. cues. does prior art, and also give a much richer experience in the 3-d structure recovery”, CVPR 2009 flythroughs created using image-based rendering, even for scenes with significant non-vertical structure. I. I NTRODUCTION Using this approach, we have created qualitatively correct 3-d these methods therefore do not apply to the many scenes th models for 64.9% of 588 images Upon seeing an image such as Fig. 1a, a human has no difficulty downloaded from the internet. not made up only of vertical surfaces standing on a hori We have also extended our model to produce 3-d structure (Fig. 1c,d). However, inferring understanding its large scale 3d floor. Some examples include images of mountains, trees models from a few images.1 such 3-d structure remains extremely 1. (a) An original image. (b) Fig. 15b and 13d), staircases (e.g., Fig. 15a), arches (e.g., Fi Fig. challenging for current Oversegmentation of the image to obtain narrow mathematical sense, and 15k), rooftops (e.g., Fig. 15m), etc. that often have Index Terms— Machine learning, Monocular systems.Learningin a“superpixels”. (c) The 3-d model predicted by the algorithm. (d) A screenshot computer vision vision, Indeed, depth, Vision and Scene Understanding, Sceneto recover 3-d depth from a single model. it is impossible Analysis: Depth of the textured 3-dimage, since richer 3-d structure. cues. we can never know if it is a picture of a painting (in which case In this paper, our goal is to infer 3-d models that are the depth is flat) or if it is a picture of an actual 3-d environment. quantitatively accurate as well as visually pleasing. WAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features” Yet in practice people perceive depth remarkably well given do not the insight that most 3-dthat are can be segmented into I. I NTRODUCTION these methods therefore just apply to the many scenes scenes
  10. 10. PROBLEM STATEMENT Given: • K views of a scene • Camera poses from structure-from-motion • Point cloud Recover an indoor Manhattan modelAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  11. 11. ck k X ) and q X (x, y 0 )in column x be px = = yx y ) x = (M, i) C(M ) (x, ⇡(x, re- (34) x xy (depicted in figure ??). Since each i=0 lies on the x=0 pxne andPre-processing X each q x lies on the ceiling plane, we have Xk ˆ M = argmax ⇡(x, yx ) (M, i) (35) 1. Detect p = Hq .x vanishing points M i=0 (1) x x 2. Estimate Manhattan homology 8 is a planar homology [?]. We ci is a concave corner >log( 1 ), if show how to recover < 3. Vertically rectify imagesif c is a concex cornertion 3.5. Once H >log( 2 ), any indoor Manhattan (36) (M, i) = is known, i : fully described by log( values c{yxan occluding corner the 3 ), if i is }, leading to the Structure recoveryarametrization, ck = W (37) M = {yx }Nx . x=1 (2) ck < W (38)y this parametrization as follows.as check whether Express posterior on models Tox0 , y0 ) lies on a vertical zor horizontal surface we likelihood X }| prior { z }| { X eed to check whether y0 is ⇡(x, yx ) yx0 (M, i)yx0 . (39) log P (M |X) = between and 0ow the 3D position of the xfloor and ceiling planes ican recover the depth of2010) we described an exactIf In (Flint et al, ECCV every pixel as follows. dynamic lies on the floor or ceiling then we simplyform. programming solution for problems of this back– ray onto the corresponding plane. If not, we back– [1] Flint, Mei, Murray, and Reid, “A Dynamic Programming Approach to Reconstructing Building Interiors”, In ECCV 2010 Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  12. 12. he camera intrinsics are unknown then we construct the camera matrix vm the detected vanishing points by assuming that the the image locations of ceiling points an available the mapping Hc!f between camera centre is indoor Manhattan scene has exactlyof thefloor and one ceiling plane, both one floor points that are image centre and the image a focal length and aspect ratio such vertically below them (see Figu choosing locations that the h normal direction vv . It willa be useful in the with axis h = v ⇥v and vertex v [15] and ca 1b).arec!f is planar homology following sections to have ted vanishing points H mutually orthogonal. l r vailable the mapping Hc!f between the image locations of ceiling points and be recovered given the image location of any pair of corresponding floor/ceilin Preliminaries (xf ,withas h = v ⇥v and vertex v [15] and can image locations of the floor points that are vertically below them (see Figure points). Hc!f is a planar homology xc ) axisIdentifying the floor and ceiling planes.l r v T recovered mapping from ceiling plane toof any pair Hc!f = I + µ vv h , • The given the image location floor plane, of corresponding floor/ceiling ( nts (xf a xc ) as homology. , planar vv · h isdoor Manhattan scene has exactly one floor Tand one ceiling plane, bothnormalFollowing rectification, H be v , xc+ µ vxh⇥ following sections to have (1) cross ratio of Hc! • where µ =< useful in the f ⇥ direction vv . It will v = I , xf , v c xalongh > is the characteristic c!f transforms points , Although we do notv locations of ceiling such pair (xf , xc ), we can recovble the mapping Hc!f between the image · h v have a priori any points and image columns. age locations of the c!f using the following RANSAC algorithm. First, we sample one point x Hfloor points that are vertically below them (see Figure ˆ from ⇥with axisabove lthe r and vertex vv [15] and can map, f ⇥h ere µ =< vv , xc , xf , xc thexregion > is the characteristic the Canny of Hc!f . then we sample c!f• Given the label yx at some column x, the orientation is a planar homology h = v ⇥v horizon in cross ratio edge Although we do second pointpriori any such pair (xf , and vwefrom the region below the horizo not have a x collinear with the first xc ), ˆof any recovered as can recoveroveredfor every pixel in that column can be pair of corresponding floor/ceiling given the image location f v !f using the following RANSAC algorithm. First,H we sample one point xc We compute the hypothesis map ˆ c!f as described above, which we then sco ˆ (xf , xc ) as m thefollows. above the horizon in the Canny edge ˆ region map, then we (x,yx) asample by the number of edge pixels that Hc!f maps onto other edge pixels (accordin T ond point xf collinear with the first µ vv h v from the region below the horizon. ˆ [x + and v 1. Compute yx’ the c!f = Iedge map). , After repeating this for a (1) to = H Cannyˆyx 1] v · h T fixed number of iteratio compute the hypothesis map Hhypothesis with greatest which we then score c!f as described above, score. v 2. Pixels between yx and theH vertical, others are we return yx’ are that>ˆis contain eitherother edgeratio of Hc!f . view of the ceiling. the numbercof edge pixels ⇥images the characteristic view ofpixels (according µ =< vv , x , xf , xc ⇥ xf h Many c!f maps onto no cross the floor or no horizontalmap). After repeating this for a fixed number of iterations the Canny do not such cases H any unimportant since therecan recover hough we edge have a priori is such pair (xf , xc ), we are no corresponding points in t c!f return the hypothesis with the best H image. If greatest score. output from the one point xc using the following RANSAC algorithm. First, we sample RANSAC process has a score below ˆ c!f he region above the horizon no viewwe set µ to amap, view of that ceiling. In Many images contain eitherk in the Canny floor or no then wethe will transfer all pixels outsi threshold t then of the edge large value sample a h cases f collinear with the first and there arethethen have no the horizon. theestimated model. point xHc!f is unimportant since vv c!f will region below impact on the ˆ the image bounds. H from no corresponding points inmpute the best Hc!f map Hc!f asthe RANSAC process has a then score aage. If the hypothesis output from described above, which we score below ˆ eshold kt then we set µ to a large value that will transfer all pixels(x,yx’) ˆ c!f maps onto other edge pixels (according outside number of edge pixels that H image bounds. Hc!f will then have no impact on the estimated model. CannyFlint, David Murray, Ian Reid repeating this for a fixed number of iterations Stereo, and 3D Features” Alex edge map). After “Manhattan Scene Understanding Using Monocular,
  13. 13. log P (X | M ) M = {c1 , (r1 , a1 ), . . . , ck 1 , (rk 1 , ak 1 ), ck } (33) log P (X M = {c1 , (r1 , a1 ), . . . , ck 1 , (rk 1 , ak 1 ), ck } (33) C(M ) = X ck ⇡(x, yx ) X ck MODEL X k X (M, i) k (34) log P (I 1:K | M) = XX X x=0 i=0 p2Io k C(M ) = ⇡(x, yx ) (M, i) (34) log P (I 1:K | M ) = X x=0 X ki=0 p ˆ M = argmax ⇡(x, yx ) M (M, i) (35) X X k M x i=0 ˆ M = argmax ⇡(x, yx ) (M, i) (35) M x i=0 ˆ M = argmax P (M )P (XXmono M )P (Xstereo | M )PX3D | M ) mono | (X3D M X stereo ˆ (36) M = argmax P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M ) M (36) P (M | X) = P (Xmono | M )P (Xstereo | M )P (X3D | M )P (M ) (37) P (M | X) = P (Xmono | M )P (Xstereo | M )P (X3D | M )P (M ) (37) log P (M | X) = log P (Xmono | M )+log P (Xstereo | M )+log P (X3D | M )+log P (M ) }| { z }| { (38) z = log P (Xmono | M )+log P (Xstereo | M )+log P (X3D | M )+log P (M ) log P (M | X) X X log P (M |X) = 8 ⇡(x, yx ) (M, i) (38) >log x 1 , if ci is a concave corner < i 8 > 2 , 1 , is i is a concave corner (39) (M, i) = log log if ci if ca concex corner > < :Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  14. 14. x i=0 8 ck < W (38) Prior>log( < if ci is a concave corner 1 ), (M, i) = log( 2 ), if ci is a concex corner (36) > : z log( X }| ci is an z }| corner 3 ), if { X occluding {log P (M |X) = ⇡(x, yx ) concave convex (M, i) occluding (39) M = {c1 , (rck a1 ), . . . , ck x =W (37) (33) 1, i 1 , (rk 1 , ak 1 ), ck } 1 n1 n 2 n 3 P (M ) = ck1 <X W ck 2 3 Xk (38) (40) Z C(M ) = ⇡(x, yx ) (M, i) (34) X x=0 i=0 ⇤ log P (M | ) =z }| P{( z | a}|) +{c X log X p p k (41) X log P (M |X)ˆ= p ⇡(x, yx ) (M,X i) (39) M = argmax ⇡(x, yx ) (M, i) (35) x i X M x i=0 ⇡mono (x, yx ) = log P ( i | a⇤ ) i (42) 8 y0 >log 1 , < if ci is a concave corner (M, i) = 1 log n1 , n2ci is a concex corner if n3 (36) P (M ) = > 1 2 : 2 3 (43) Zlog 3 , if ci is an occluding corner ck = W (37) ck < W (38) Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  15. 15. 2, 3} for each pixel, with < W corresponding to the ck values (38)e Manhattan orientations (shown as red, green, and blueons in figure 1). As described in section 3, a is deter- Likelihood For Photometric Features ck < W (38)istic given the modelz . We assume a linear likelihood M }| { z }| { X Xpixellog P (M |X) = features ⇥, ⇡(x, yx ) (M, i) (39) z X }| { z }| { x T Xi log P (M |X) = a) = yx ) ⇥ P (⇥ | ⇡(x, w a .(M, i) (39) (5) x w i T⇥ 1 j n1 a n2j n3 P (M ) = 1 2 3 (40)We now derive MAP inference. The posterior on M is 1 Z n1 n 2 n 3 P (M ) = (40) Z X⇤ 3 1 2 log P | | = ⇥P P (⇥ i ⇤ ) P (M ( ⇥) M ) =(M )log P ( p | ai ) + c p (6) (41) X log P (M | )= p i log P ( p | a⇤ ) + c p (41) re ai is the orientation p Xdeterministically predicted bydel M at pixel pi and ⇥ is X log P ( i a⇤ ) ⇡mono (x, yx ) = a normalizing| constant. We i (42) = y0 equals i a⇤ (42) ⇡mono (x, yx )since itlog P ( 1 |for )a and 0 oth-e omitted P (ai | M ) i i y0ise. Taking logarithms, 1 n1 n2 n3 P (M ) = 1 2 3 (43) Z log P (M | ⇥) = n1 ⇤⇥ + n2 ⇤⇥ + n3 ⇤⇥ 1 2 3 ⇥ + X log P (⇥ i | ai ) + k ⇣ ⌘ (7) X log P (D|M ) = i log P (di | pi , yx ) , (44) x i2Dx re ⇤⇥ = log ⇤3 and similarly for the other penalties, and 3orresponds to the normalizing denominators in (6) and X⇣ X which we henceforth drop since it makes no difference ⌘ log P (D | M ) = log P (di | pi , yx ) , (45)he optimization to come. We can now put (7) into payoff x i2Dxm (3) by writing Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  16. 16. 1 n1 n2 n3 P (M ) = 1 2 3 (43) Z Likelihood For Photoconsistency Features ⇣ ⌘ X X log P (D|M ) = log P (di | pi , yx ) , (44) x i2Dx Frame 0 Frame i X⇣ X ⌘ log P (D | M ) = log P (di | pi , yx ) , (45) x i2Dx X log P (X M ) = ⇡(x, yx ) (46) Figure 6. The graphical| model relating indoor Manhattan modelsiple views are com- to 3D points. The hidden variablext indicates whether the point isM followed by re– inside, outside, photo-consistency measure or coincident with the model.reprojection of p into frame k X X z}|{ z K }| { log P (I 1:K | M ) = PC (p, reprojk (p, M )) , reprojk (p; yx ) and write p2Io k=1 (47) stereo for the case Ny M lable. We assume stereo (x, yx ) = PC(p, reprojk (p, yx )) , (10)es I1 , . . . , IM . We y=1 k=1mera, as output for em, and that cam- where p = (x, y). To see this, substitute (10) into (3) and Equivalent to canonical stereo formulation intensities to zero observe thatto indoor is precisely (9). subject the result Manhattan assumption. Note that the column–wise decomposition (10) neither Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  17. 17. 3.4. Combinin Likelihood For Point Cloud Features We combine model by assum P (M | Xmono , P (M )P (Xm Taking logarithm Figure 7. Depth measurements di might be generated by a surface in our model (represented by ti = ON) or by an object inside ⌃joint (x) = or outside the environment (in which case ti = IN, OUT respec- 3.5. Resolving tively). We resolve th follows. If C is as through a window. The likelihoods we use are the vertical vani ⇤ mal to the floor , if 0 < d < r(p; M ) this orientation t P (d | p, M, IN) = (11) 0, otherwise number of point ⇤ of the diameter ⇥ , if r(p; M ) < d < Nd take as the floor P (d | p, M, OUT) = (12) 0 , otherwise mum locations s P (d | p, M, ON) = N (d ; r(p; M ), ⌥) . (13) We found that t on our training s where and ⇥ are determined by the requirement that Let the two n the probabilities sum to 1 and r(p; M ) denotes the depth and let h = vl predicted by M at p. We compute likelihoods on d by xf and xc on thAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  18. 18. payoff m p, yx ) = 1 P (d | P (M) = P (d |np, M 2 . n3 1 n) (15) (40) Likelihood For Point Cloud Features 1 2 3 Z Let D denote all depth measurements, P denote all pixels, X and Dx contain indices for = log P ( p | a⇤ ) in log P (M | )all depth measurements+ c col- (41) p In previo umn x. Then p model M ⇧ ⇧ X “cropped ⇡=P (M )yx ) = PlogiP ( ,i yxa⇤ ) (16) (42) P (M | D, P ) mono (x, (d | pi | )i interval [ x i⇤Dxy0 interval. ⌅ ⌅ ⇥ which M ˆ log P (M | D, P ) =P (M ))+ 1 n1 log P n3 | p , y ) , n2 (d P (M = 1 2 3 i i x (43) Our a Z i⇤Dx x respects. (17) X⇣ X ⌘ the form which welog P (D|M ) = form as log P (di | pi , yx ) , write in payoff (44) as assum x i2Dx corners a ⌅ ⌃3D (x, yx ) = log P (di | pi , yx ) (18) directly i i⇤Dx ity by O( For comp and the penalty function ⇤ remains as in (8). appendixAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  19. 19. model by assuming conditional independence given M , ˆ M = argmax P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M ) Combining| Xmono , Xstereo , X3D ) = P (M Features M (36) (19) P (M )P (Xmono | M )P (Xstereo | M )P (X3D | M ) Taking logarithms leads |to summationM )P (X3D | M )P (M ) P (M | X) = P (Xmono M )P (Xstereo | over payoffs,a surface (37) ct inside ⌃joint (x) = ⌃mono (x) + ⌃stereo (x) + ⌃3D (x) . (20)T respec- 3.5. log P (M | X) = log P (Xmono | M )+log Pplanes | M )+log P (X3D | M )+log Resolving the floor and ceiling (Xstereo (38) We resolve the equation of the floor and ceiling planes as follows. If C is8 camera matrix for any frame and vv is the >log 1 , if c is a concave corner the vertical vanishing in that iframe, then n = C 1 vv is nor- < mal to (M, floor>log ceilingcplanes. We corner a plane with the i) = : and 2 , if i is a concex sweep (39) (11) this orientation through the ci is an occluding corner step the log 3 , if scene, recording at each number of points within a distance ⌅ of the plane (⌅=0.1% (40) of the diameter of the point = W in our experiments). We ck cloudd (12) take as the floor and ceiling planes the minimum and maxi- mum locations such that the < W contains at least 5 points. ck plane (41) (13) We found No approximations other than conditional failure that this simple heuristic worked without independence and occlusions{ on our training set. z X }| { z }| Xent that Let the two non–vertical vanishing points be vl(42) vr log P (M |X) = ⇡(x, yx ) (M, i) and Alex Flint, David Murray, letReid = v ⇤ v . Select any twoi corresponding points and 3D Features” and Ian h “Manhattan Scene Understanding Using Monocular, Stereo, x
  20. 20. er of this paper if we can assume that verticalworld appear vertical in the image. To this ende simple rectification procedure of [?]. X⇣ X = {c1 , (r1 , a1 ), . . . , ck 1 , (rk 1 , ak 1 ), ck } (33) log P (D | M ) = log P (di | pi , yx INFERENCEdescribe our parametrization for indoor Manhat- x i2Dx Let the image dimensions be Nx ⇥ Ny . Follow-tion, the vertical seams at which adjacent walls X ck Xk X log P (X | M ) = ⇡(x, yx ) C(M ) = ⇡(x, yx ) column (34) to vertical lines, so each image(M, i) inter- x x=0 i=0y one wall segment. Let the top and bottom of olumn x be px X(x, yx ) andX = (x, yx ) re- = k qx 0 X X z}|{ z K } ˆ M = argmax ??). Since each p(M, i) on the (35)depicted inM figure ⇡(x, yx ) lies log P (I 1:K | M) = PC (p, reprojk MAP inference . . . , c i=0, (rx , a ), c } x (33) nd each q M = x lies on a1 ),ceiling plane, we have k {c1 , (r1 , the k 1 k 1 k 1 p2Io k=1 ˆ M = argmax P (M | X) (36) px = Hq xck M . k (1) X X C(M ) = ⇡(x, yx ) (M, i) (34)a planarP (M )P (X [?]. |We)P (X howi=0 recover | M )argmax homologymono M showstereo |to payoff matrix: Reduced to optimisation over )P (X3D x=0 Mn 3.5. Once H is known, any indoor Manhattan M X X (37) k ly described by the values {yx }, leading to the ˆ M = argmax ⇡(x, yx ) (M, i) (35)metrization, M x i=0 X) = P (Xmono | M )P x stereo | M )P (X3D | M )P (M ) N (X M = {yx }x=1 . 8 (2) (38) >log( 1 ), if ci is a concave corner <is parametrization as follows. iTo check whether (M, i) = log( 2 ), if c is a concex corner (36)(M)| lies = log >(XmonoorM )+log P (Xstereo | M )+log P (X3D | M )+log P (M ) y0 X) on a : vertical | horizontal surface we P log( 3 ), if ci is an occluding corner to check whether y0 is between yx0 and yx0 . 0(39) the Alex Flint, David Murray,theReid and ceiling planes Scene Understanding Using Monocular, Stereo, and 3D Features” 3D position of Ian floor “Manhattan
  21. 21. t a mul- error is a per–pixel average of (23). ⌃fup (x, y 1, a ) (x) Figure 9 also shows that joint estimation is superior to a mul- error is a per–pixel average of (23). ⇧ pproach, fout (x, y, a) = 0max fdown (x, y + 1, a ) (x)pproach, one sensor modality alone. Anecdotally we find sing any have ex- a ⇥{1,2} ⌃ ⌅ have ex- Recursive Sub-problem Formulationhat using 3D cues future work we intend to use indoor Manhattan mod- n we feel In alone often fails within large texturelessnnefitfeelin which tofuture work we intend to use indoorscene categories. we of egions a elsInthe structure–from–motion system failed mod- reason about objects, actions, and Manhattan fin (x, y, a ) (x) (25) els to reason about objects, actions, and scenefor learningnefitcues. points, whereas to investigate structuralcues alone categories.o trackaany We also intend stereo or monocular SVMs 3D of ⇥ We also intend to investigate us to relaxSVMs for learning fup (x, y, a) = max f (·), fup (x, y 1, a) , parameters, which may allow structural the conditional in- (26)3D cues. figure 9. What is the optimal model up to column x? ften performparameters,such regions but us to lack precision better in which may allow can relax the conditional in- figure 9. anddependence assumptions between sensor modalities. in ⇥ rs. Even t corners boundaries. s. Even dependence assumptions between sensor modalities. fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) , (27)m outper- 11 shows timing results for our system. For each Figure outper- flects the 8. Appendix ⇥flects rep- frames, our system requires on average less thanriplet ofattan the 8. Appendix fin (x, y, a) = max fout (x , y , a) + , (28) ttansecond to compute features for all three MAP inference. ne rep- Recurrence relations for frames and less Let x0 <xhan 100 milliseconds to perform optimization. Ny ,inference. be the % of our fout (x, y, a), 1 ⇤relations, 1 ⇤ yMAP a ⇧ {1, 2} Recurrence x ⇤ Nx for ⇤ Let ⌥x% of our fout (x, y, a), 1 ⇤ for any x , 1 ⇤ y ⇤ Ny , a model 2} be the maximum payoff x ⇤ N indoor Manhattan ⇧ {1, M span- = ⇥(i, y ) . (29) show re- maximum payoff for any indoor(i) M contains a floor/wall ning columns [1, x], such that Manhattan model M span-7. s. Conclusionshow re- Label- ning columns [1, x], such that the M contains a floor/wall intersection at (x, y), and (ii) (i) wall that intersects col- i=x0 . Label-ular–only We have presented a Bayesian and (ii) the out can be intersects col- we have treated fin , fup , and fdown simply as nota- lar–onlyn of 10% intersection orientation a. Then f wall that computed by umn x has at (x, y), framework for scene un- Here umn x has orientation a. Then fout can be computed by tional placeholders; for their interpretations in terms of sub– erstanding in the context of a of the recurrence relations,ap- of 10%rocedure, recursive evaluation moving camera. Our ocedure, Recurrence relations recursive evaluation of the recurrence relations, ⇤ roach draws on the indoor Manhattan assumption intro- ⌃ Boundary Conditions problems see [7]. Finally, the base cases are perior for monocular reasoning and we up (x, y shown) that (x) uced to ⇤f ⇧ have 1, a ⌃f (x, y 1, a ) (x)y we find from monocular = a0max ⇧fup (x, y + 1, a )perior toechniques fout (x, y, a) and max ⌃f down stereo vision can+ 1, inte- (x) ⇥{1,2} ⌅ be a ) (x) fout (0, y, a) = 0 ⌃y, a (30) we find xtureless fout (x, y, a) = 0 down (x, y rated with 3D data in a coherent Bayesian framework.(x) a ⇥{1,2} ⌃fin (x, y, a ) ⌅ fup (x, 0, a) = ⌅ ⌃x, a (31)xturelessem failed fin (x, y, a ) (x) (25) uesfailed excludes cases for which [14] was unable to find overlapping m 1 This row alone ⇥ (25) fdown (x, Nx , a) = ⌅ ⌃x, a . (32) precision initialization. (x, y, a) = max fin (·), fup (x, y 1, a) ⇥ , fup (26) nesalone es duringprecision fup (x, y, a) = max fin (·), fup (x, y 1, a) ⇥ (26) , fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) ⇥ , (27) For each fdown (x, y, a) = max fin (·), fdown (x, y + 1, a) , (27) ⇥For each less than fin (x, y, a) = max fout (x , y , a) + ⇥ , (28)sess than and less x0 <x fin (x, y, a) = max fout (x , y , a) + , (28) O(WH) x and less x0 <x ⌥ = ⌥ ⇥(i, y ) . x (29) = i=x0 ⇥(i, y ) . (29) i=x0 ceneFlint, Mei, Murray,we have treated fin , Programming Approach to as nota- un- Here and Reid, “A Dynamic fup , and fdown simply Reconstructing Building Interiors”, In ECCV 2010cene Alex Flint,Here we haveIan Reid fin , fup , and fdown in termsas nota- Our un- ap- tional placeholders; for their interpretations simply of sub– treated David Murray, [7]. Finally, the base cases are “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features” problems see tional placeholders; for their interpretations in terms of sub–
  22. 22. RESULTSInput • 3 frames sampled at 1 second intervals • Camera poses from SLAM • Point cloud (approx. 100 points)Dataset • 204 triplets from 10 video sequences • Image dimensions 640 x 480 • Manually annotated ground truthAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  23. 23. RESULTSAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  24. 24. RESULTS Algorithm Mean depth error (%) Labeling error (%) Our approach (full) 14.5 24.5 Stereo only 17.4 30.5 3D only [1] 15.2 28.9 Monocular only 24.8 30.8 Brostow et al. [2] 39.4 Lee et al. [3] 79.8 54.5[1] Flint, Mei, Murray, and Reid, “A Dynamic Programming Approach to Reconstructing Building Interiors”, ECCV 2010[2] Brostow, Shotton, Fauqueur, and Cipolla, “Segmentation and recognition using structure from motion point clouds”,ECCV 2008[3] Lee, Hebert, and Kanade, “Geometric reasoning for single image structure recovery”, CVPR 2009Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  25. 25. RESULTS Stereo Features 730ms 3D Features 9ms Inference 102ms Monocular Features 160ms 997ms mean processing time per instanceAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  26. 26. RESULTS Sparse texture Non-ManhattanAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  27. 27. RESULTS Poor Lighting ConditionsAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  28. 28. RESULTS ClutterAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  29. 29. RESULTS Failure CasesAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  30. 30. RESULTS Failure CasesAlex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  31. 31. SUMMARY• We wish to leverage multiple-view geometry for scene understanding.• Indoor Manhattan models are a simple and meaningful model family.• We have presented a probabilistic model for monocular, stereo, and point cloud features.• A fast and exact inference algorithm exists.• Results show state-of-the-art performance.Alex Flint, David Murray, Ian Reid “Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×