Successfully reported this slideshow.
Upcoming SlideShare
×

# Semantic Mapping of Road Scenes

1,006 views

Published on

Semantic mapping of road scenes, PhD thesis. The main aim of the thesis is to investigate and propose solutions to the scene understanding problem of finding 'what' objects are present in the world and 'where' are they located.

Published in: Science
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Semantic Mapping of Road Scenes

1. 1. P H D T H ES I S D E F E N C E S U N A N D O S E N G U P TA OX FO R D B RO O K ES U N I V E RS I T Y Semantic Mapping of Road Scenes 1 Supervisors – Prof. Philip Torr and Prof. David Duce 16/06/2014
2. 2. Outline  Introduction  The Labelling problem  Dense Semantic Map (chap. 3)  Dense 3D Semantic Modelling (chap. 4)  Mesh Based Inference (chap. 5)  Hierarchical CRF on an Octree Graph (chap. 6)  Conclusion 2
3. 3. Objective  Holy grail of computer vision  What are the objects present in the scene  Where are they located  Biological vision performs these two activities through human visual perception.  Computers ( or humans through them) try to solve the same issue through an information processing route.  Gather sensor data (images, gps, imu,…)  Represent them into a map  Recognise objects in the map  This thesis aims to look in this very problem and propose solution towards addressing it. 3 Can happen simultaneously or sequentially Chap 1, Sec 1.2
4. 4. Objective - Visually  Input image of a street scene, person cleaning, some cars in the background, and buildings in the horizon.  Place the appropriate objects at right distance from camera in correct size. 4 Chap 1, Sec 1.2 Image courtesy: Antonio Torallba, http://6.869.csail.mit.edu/fa13/
5. 5. Why it is important 5  Numerous applications from robotics, entertainment, engineering, medical…  Self driving cars  Engineering  Robots for manipulation  Humanoids  Assistive vision for impaired  Entertainment  Aim for a vision based system to produce a semantically consistent scene from visual inputs Chap 1, Sec 1.2
6. 6. Essentially a hard problem 6  Large variation in the image formulation  Scene Variation  Varying scene type and geometry  Object level variation  Large number of object classes  Individual Object location and orientation  Object shape and appearance  Depth/occlusions  Illumination  Shadows  Motion blur Chap 1, Sec 1.2
7. 7. Thesis - Contributions 7  This thesis provides solutions for large scale outdoor urban semantic mapping.  Large scale Dense overhead semantic mapping.  Semantic from local images fused to form a global ground plane map  First attempt to generate such map.  ~15km of semantic mapping  One of the first large scale semantic map  Presented as oral in IEEE IROS 2012 Chap 1, Sec 1.3
8. 8. Thesis - Contributions 8  Dense semantic reconstruction  Dense 3D semantic reconstruction from kms of stereo images.  Online sequential volumetric reconstruction to accommodate arbitrarily long road scenes.  Presented as oral in IEEE ICRA 2013.  Mesh based inference for scene labelling  Improved labelling accuracy and consistency.  Depth sensitive classifier fusion.  25x faster in inference time (than image labelling).  Presented as poster in CVPR 2013. Chap 1, Sec 1.3
9. 9. Thesis - Contributions 9  Hierarchical CRF on an Octree Graph  Unified framework to determine free and occupied regions in a scene along with object class labels.  Robust PN potential over octree volumes  Datasets (available online)  Yotta labelled dataset: multiview street images (urban, rural, highway) containing 8000+ images, with object class labellings  Kitti Labelled dataset: Object class labelling for publicly available KITTI dataset Chap 1, Sec 1.3
10. 10. Publications 10  Related to Thesis  S. Sengupta, P. Sturgess, L. Ladicky, P. H. S. Torr: Automatic dense visual semantic mapping from street- level imagery. IEEE/RSJ IROS 2012 (Chapter 3 )  S. Sengupta, E. Greveson, A. Shahrokni, P. H.S. Torr: Urban 3D Semantic Modelling Using Stereo Vision, IEEE ICRA, 2013 (Chapter 4 )  S. Sengupta*, J. Valentin*, J. Warrell, A. Shahrokni, P. H.S. Torr: Mesh Based Semantic Modelling for Indoor and Outdoor Scenes, IEEE CVPR, 2013. ( *Joint first authors, Chapter 5.)  S. Sengupta*, J. Valentin*, J. Warrell, A. Shahrokni, P. H.S. Torr: Mesh Based Semantic Modelling for Indoor and Outdoor Scenes. SUNw: Scene Understanding Workshop. Held in conjunction with CVPR , 2013. (*Joint first authors, Invited paper )  Datasets  Yotta Labeled road scene dataset.  KITTI object labelling. (Datasets available at http://www.robots.ox.ac.uk/~tvg/projects )  Other publications  Z. Zhang, P. Sturgess, S. Sengupta, N. Crook, P. H.S. Torr: Efficient discriminative learning of parametric nearest neighbor classifiers, IEEE CVPR, 2012  L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, P. H. S. Torr: Joint Optimization for Object Class Segmentation and Dense Stereo Reconstruction. IJCV 2012 (Invited paper)  L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, P. H. S. Torr : Joint Optimisation for Object Class Segmentation and Dense Stereo Reconstruction. BMVC 2010 (BMVA Best science paper ) Chap 1, Sec 1.4
11. 11.  Multiple computer vision task modelled as labelling problem  Assign a discrete set of sites a label from the set  E.g. pixel associated with an object class label The labelling problem 11 Chap 2, Sec 2.1
12. 12. 12 What are the Labels  Discrete or continuous  Discrete  Image pixels assigned to object classes like Cars, humans, buildings, pavement, trees etc.  Foreground/background labels  Indoor/outdoor labels…  Continuous range  Depth: Pixels can take a set of disparity labels  Optical flow Chap 2, Sec 2.1
13. 13. 13 CRF-Framework  Set of random variables corresponding to each pixel and the label set  Aim is to associate every random variable with a label  The conditional probability of the labelling x given the data D,  Gibbs free energy is given as  MAP labelling x*of the random field is defined by },...,,{ 21 NxxxX  Chap 2, Sec 2.2
14. 14. 14 • The pixel labelling problem can be formulated as an pair- wise/higher-order CRF problem whose energy is • The image is represented as a graph: G = {V,E} • V is the total set of nodes of the graph • Ni represents the neighbourhood of the node i • The unary potential measures the cost of assigning particular label to the pixel • Generated using the result of a boosted classiﬁer over a region about each pixel CRF modelling for image labelling Chap 2, Sec 2.2
15. 15. 15 • The pairwise term or the smoothness term depends on the inter-pixel observations, should be discontinuity preserving across the object boundaries • Takes Potts form • where • Higher order potentials defined on a group of pixels conditionally dependant on each other. • Robust PN, Hierarchical PN models [1] • Final labelling obtained through minimising the Energy E CRF modelling for image labelling Chap 2, Sec 2.2 [1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009.
16. 16. 16 Quite hard  The energy minimization is quite hard (large number of random variables with interconnections).  Possible solution – simulated annealing, ICM, but slow.  Approximate algorithms exist for certain energy functions for a multi-label problem.  Move-making algorithms[1]  α – expansion: for each α, allow the random variables to retain existing label or change to the label α, using graph cuts.  αβ swap: considers a pair of label at each iteration, such that all pixels change their label from β to α though graph cuts. Chap 2, Sec 2.2[1]Boykov et.al. Fast Approximate Energy Minimization via Graph Cuts, ICCV
17. 17. Stereo  Early attempts to explain depth begins in the renaissance  Essentially the images subtended at the left and right eyes can be used to obtain a disparity/depth map 17 Stereo sketch by Jacopo Chimenti da Empoli, Italy , around 1600 AD Leonardo da Vinci, Optical Studies on Binocular vision Chap 2, Sec 2.3
18. 18. Depth from Sequence of images 18  Structure from motion for sparse 3d reconstruction.[1]  Visual hull/Silhouettes based volume carving[2]  Elevation/Height/2.5D maps[3]  Tsdf/Voxel based Fusion[4] Chap 2, Sec 2.3 [1] Sameer A. et.al. Building rome in a day. Commun. ACM, 2011. [2] Friedrich E. Al. Stixmentation - probabilistic stixel based traffic scene labeling. BMVC 12 [3] Y. Furukawa et.al. Carved visual hulls for image-based modeling. IJCV, 2009 [4] Richard N. et. al. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE ISMAR 2011.
19. 19. Dense Semantic Mapping  Generate an overhead view of an urban region.  Label every pixel in the Map View is associated with an object class label BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post 19 Chap 3, Sec 3.1
20. 20.  Street images captured inexpensively from vehicle with multiple mounted camera[1]. [1] Yotta. DCL, “Yotta dcl case studies,” Available: http://www.yottadcl.com/surveys/case-studies/ 20 Dense Semantic Mapping
21. 21. Semantic Mapping Framework  Semantic mapping framework comprises of two stages Street level Images acquisition 21 Chap 3, Sec 3.3
22. 22. Semantic Mapping Framework  Semantic mapping framework comprises of two stages  Semantic Image Segmentation at street level. Street level Images acquisition Image Segmentation 22
23. 23.  Semantic mapping framework comprises of two stages  Semantic Image Segmentation at street level.  Ground Plane Labelling at a global level.  First attempt to do an overhead mapping from street level images. Semantic Mapping Framework Street level Images acquisition Image Segmentation Ground plane labelling 23
24. 24. Street-level Image Segmentation  Label every pixels in the image with object class labels BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post Input Output Raw Image Labelled Image Automatic Labeller Object Class Labels 24 Chap 3, Sec 3.3.1
25. 25. Street-level Image Segmentation 25  CRF based image labeller  Each pixel is a node in a grid graph G = (V,E).  Each node is a random variable x taking a label from label set. CRF construction Final SegmentationInput Image
26. 26. Semantic Image Segmentation - CRF 26  Total energy  Optimal labelling given as    Cc cc NjVi jiij Vi ii i xxxE )(),()()( , xx  Epix Epair Eregion
27. 27.  Total energy E = Epix + Epair + Eregion  Epix - Model individual pixel’s cost of taking a label.  Computed via the dense boosting approach  Multi feature variant of texton boost[1] Semantic Image Segmentation - CRF 27 x Car 0.2 Road 0.3 [1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009.
28. 28.  Total energy E = Epix + Epair + Eregion  Epair - Model each pixel neighbourhood interactions.  Encourages label consistency in adjacent pixels  Sensitive to edges in images.  Contrast sensitive Potts model xi xj CarCar Road 0 g(i,j) Road Semantic Image Segmentation - CRF 28 [1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009. Epair
29. 29.  Total energy E = Epix + Epair + Eregion  Eregion - Model behaviour of a group of pixels.  Classify a region  Encourages all the pixels in a region to take the same label.  Group of pixels given by multiple meanshift segmentations Semantic Image Segmentation - CRF 29 [1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009.
30. 30. 30  Energy minimisation using alpha-expansion algorithm[1] BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post Input Image Road Expansion [1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 99 30 Semantic Image Segmentation - CRF
31. 31. 31 Input Image Building Expansion BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post [1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 99 31  Solved using alpha-expansion algorithm[1] Semantic Image Segmentation - CRF
32. 32. Input Image Sky Expansion BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post [1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9932 32  Solved using alpha-expansion algorithm[1] Semantic Image Segmentation - CRF
33. 33. Input Image Pavement Expansion BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post [1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9933 33  Solved using alpha-expansion algorithm[1] Semantic Image Segmentation - CRF
34. 34. Input Image Final solution BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post [1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9934 34  Solved using alpha-expansion algorithm[1] Semantic Image Segmentation - CRF
35. 35. Ground Plane Labelling  Combine many labellings from street level imagery. Automatic Labeller Output Labelled Ground PlaneStreet Level labellings Input 35
36. 36. Ground Plane CRF  A CRF defined over the ground plane.  Each ground plane pixel (zi) is a random variable taking a label from the label set.  Energy for ground plane CRF is Z 36 g pair g pix g EEZE )( Chap 3, Sec 3.3.2
37. 37. 37 Ground Plane Pixel Cost  We assume a flat world. K X Z 37
38. 38. Ground Plane Pixel Cost Homography Road Pavement Post/Pole K X Z  A ground plane region is estimated. 38 38
39. 39. • Each point in the image projects to a unique point on the ground plane. – Creating a homography K X Z Ground Plane Pixel Cost Homography Road Pavement Post/Pole 39 39
40. 40. • The image labelling is mapped to the ground plane – via the homography. K X Z Ground Plane Pixel Cost Ground plane Pixel histograms Homography Road Pavement Post/Pole 40 40
41. 41. • Labels projected from many views are combined in a histogram. • The normalised histogram gives the naïve probability of the ground plane pixel taking a label. Ground Plane Pixel Cost 41 K X Z Ground plane Pixel histogramsHomography Road Pavement Post/Pole 41 41
42. 42. • Labels projected from many views are combined in a histogram. • The normalised histogram gives the naïve probability of the ground plane pixel taking a label. Ground Plane Pixel Cost K X Z Ground plane Pixel histogramsHomography Road Pavement Post/Pole 42 Chap 3, Sec 3.3.2 42
43. 43. Ground Plane labelling  Histogram is built for every ground plane pixel giving Eg pix  Pairwise cost (Eg pair) added to induce smoothness  Contrast sensitive potts model Z 43
44. 44. Ground Plane labelling  Final CRF solution obtained using alpha expansion. Void 44
45. 45. Ground Plane labelling Road expansion  Final CRF solution obtained using alpha expansion. 45
46. 46. Ground Plane labelling Building expansion 46  Final CRF solution obtained using alpha expansion.
47. 47. Ground Plane labelling Pavement expansion 47  Final CRF solution obtained using alpha expansion.
48. 48. Ground Plane Labelling Final Solution 48  Final CRF solution obtained using alpha expansion.
49. 49. Experiments - Dataset  Subset of the images captured by the van  ~15 km of track, 8000 images from each camera.  Pixel-level labelled ground truth images. Dataset available[1].  13 object categories –  Training - 44 images, testing - 42 images. [1] http://www.robots.ox.ac.uk/~tvg/projects/SemanticMap/index.php BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post 49 Chap 3, Sec 3.4.1
50. 50. SIS Results  Input Images, output of our image level CRF, ground truths.  Used Automatic Labelling environment[1] [1] The Automatic Labelling Environment, L Ladicky, PHS Torr. Code available http://cms.brookes.ac.uk/staff/PhilipTorr/ale.htm BuildingRoadTreeVegetation FenceSignage SkyPavement Car Pedestrian Bollard Shop Sign Post 50 Input Semantic segmentation Ground Truth
51. 51. Semantic Map Results 51 Semantic map of Pembroke city Chap 3, Sec 3.4.2
52. 52. Ground plane Map Evaluation 52 Street Images Back-projected Map results Ground Truth • We back-project the ground plane map into image domain and evaluate the results. • Global pixel accuracy of 83% 52 52
53. 53. Results - video 53
54. 54. Chapter Summary  Presented a method to generate overhead view semantic mapping.  Experiments on large tracks (~15km) which can be scaled up to country wide mapping  Dataset available[1].  However a flat world assumption does not represent the 3D scene properly – our aim is to perform a semantic metric reconstruction of the world. [1] http://cms.brookes.ac.uk/research/visiongroup/projects/SemanticMap/index.php 54
55. 55. Urban 3D Semantic Modelling Using Stereo Vision 55 [1] Input Stereo image Sequence Dense 3D Semantic Model  Given a sequence of stereo images we generate a dense 3D, semantic model Chap 4, Sec 4.1
56. 56. Pipeline –Semantic Reconstruction 56  Stereo images Chap 4, Sec 4.3
57. 57. Pipeline –Semantic Reconstruction 57  Stereo images  Camera pose estimation and individual depth map generation
58. 58. Pipeline –Semantic Reconstruction 58  Surface reconstruction
59. 59. Pipeline –Semantic Reconstruction 59  Semantic labelling of street view images
60. 60. Pipeline –Semantic Reconstruction 60  Semantic model generation
61. 61. Camera Estimation 61  Feature tracking using left-right pair and consecutive frames Chap 4, Sec 4.3.1
62. 62. Camera Estimation  Use the feature tracks to estimate camera poses.  Use bundle adjustment [a]Andreas Geiger et. Al. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite CVPR 2012 62
63. 63. Bundle Results 63  Bundler results after 10, 20, 30 and 40 frames
64. 64. Sparse Reconstructions 64  But our target is to obtain a large scale dense 3D world representation.
65. 65. Depth-Map Estimation  Semiglobal block matching[1] for disparity estimation  Per-pixel depth computed as z = B × f / d [1] H. Hirschmueller, Stereo Processing by Semi-Global Matching and Mutual Information. PAMI 2008. B – Baseline f - Focal Length d – pixel disparity 65
66. 66. Depth Fusion  Depth estimates are fused using camera poses.  Fused into truncated signed distance (TSDF) volumetric representation[1].  Surface mesh generated though marching tetrahedra algorithm. [1] Brian Curless and Marc Levoy, A Volumetric Method for Building Complex Models from Range Images Siggraph 96. Chap 4, Sec 4.3.2 66
67. 67. Depth fusion using TSDF Volume [1]  Entire space divided into grids of voxels.  For each voxel compute the truncated signed distance.  +ve increasing when it lies in the free space,  -ve when it lies behind the surface  zero when lies on the surface  Performed for all depth maps. [1] Brian Curless and Marc Levoy, A Volumetric Method for Building Complex Models from Range Images Siggraph 96. 67
68. 68. TSDF Volume -.8 -.4 .1 .5 1 1 1 Camera Actual surfaceTSDF volume 68
69. 69. TSDF Volume -1 -.8 -.3 .2 .8 1 1 1 -1 -.9 -.4 .1 .5 1 1 1 -1 -1 -.8 -.2 .1 1 1 1 -1 -1 -.9 -.3 .2 .8 1 1 -1 -1 -.9 -.4 .3 .9 1 1 -1 -1 -.8 -.3 .3 .9 1 1 -1 -1 -.9 -.5 .2 .8 1 1 -1 -1 -.6 .1 .7 1 1 1 Camera TSDF volume Actual surface 69
70. 70. Fusing multiple depth maps 70  Increased number of depth maps results in smooth surface generation Chap 4, Sec 4.3.2
71. 71. Incremental Volume Update  Road scenes are generally described through arbitrarily long image sequence.  3x3x1 volume of voxel grids initialised 71 Vehicle path ~1km
72. 72. Incremental Volume Update  Need to map large sequence  3x3x1 volume of voxel grids initialised  Incrementally add volume as the vehicle moves out of the region  Allows to map arbitrarily long sequence  Important for outdoor scenes 72 Vehicle path ~1km
73. 73. Large scale dense reconstruction 73  Textured reconstruction.
74. 74. Semantic Model Generation  We use conditional random field framework (CRF) 74 • Each pixel is a node in a grid graph G = (V,E) having a random variable x taking a label from label set. • Total energy E = Epix + Epair + Eregion • Epix - Model individual pixel’s cost of taking a label. [1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009. CRF construction[1] Image SegmentationInput Image Chap 4, Sec 4.4.1 x Fence 0.2 Road 0.3
75. 75. Semantic Image Segmentation  Epair- Model each pixels neighbourhood interaction.  Encourages label consistency in adjacent pixels and sensitive to edges.  Contrast sensitive Potts model  Both colour and depth images are used  Eregion - Model behaviour of a group of pixels  Groupings though superpixels xi xj Fence Road 0 g(i,j) Fence Road 75 Epair
76. 76. Semantic Image Segmentation - Results  Input Images, output of our image level CRF, ground truths. 76
77. 77. Mesh Face Labelling  A histogram of labels is built for each mesh face (Zf ), by projecting the points from the face into labelled images.  Majority label is considered as the label of the face. Chap 4, Sec 4.4.2 77
78. 78. Semantic Model Top: Left – Surface reconstruction, Right – Semantic model Bottom: Left - input image, Right- object label set 78
79. 79. Evaluation  KITTI Object Labelled Datasets: Manually labelled images for object class training (available for download). [1]  The Model is projected back using the estimated camera poses to create labelled images.  The points in the model far away from the camera are ignored in the projection. [1] http://www.robots.ox.ac.uk/~tvg/projects/SemanticUrbanModelling/index.php Chap 4, Sec 4.5 79
80. 80. Evaluation  Metrics  Recall = tp/(tp+fn)  Intersection vs Union = tp/(tp+fn+fp) 80
81. 81. Video
82. 82. Long Sequence 82  1km dense reconstruction overlaid on a google map. Path of the vehicle.
83. 83. Chapter Conclusion  Large scale dense semantic reconstruction  Sequential volume update for accommodating long sequences  Labelled dataset released.  Labelling performed in image level – results in semantic inconsistency, redundant labelling and slow overall inference process.  Object layout in the scene helps in labelling 83
84. 84. Chapter 5 - Mesh Based Scene Labelling 84  Motivation  Redundancy : Individual street level image labelling – 0.5m pixels per image to process. (scene of 100-150 images ~ 75m pixels) : Slow  Inconsistency in labelling  Utilizing structure through mesh connectivity.  Solution: Perform labelling on mesh Chap 5, Sec 5.1
85. 85. Mesh labelling Framework 85  Depth maps fused into mesh.  Every mesh location associated with set of image pixels across a set of images.  Obtain a combined appearance score from these pixels through a depth sensitive fusion of scores.  Define CRF on mesh and perform inference on the structure. Mesh based labelling framework
86. 86. CRF over Scene Mesh 86  We use conditional random field framework (CRF) defined over the mesh locations. • Each mesh vertex is a node in a graph G = (V,E), where E is defined according to mesh neighbourhood. • Each node is a random variable x taking a label from label set. Chap 5, Sec 5.3
87. 87. Unary Score 87  Total energy  Pixel class-wise classifier score given as , which are combined as:  ‘f’ can take ‘max’, ‘average’ or ‘weighted’.  ‘weighted’ – weigh inversely the class scores by 3D distance of the pixel from respective camera centre. xi Image pixel set from K images (Registration) vertex := Chap 5, Sec 5.3.1
88. 88.  Pairwise defined on the mesh connectivity.  Takes the form of potts  , with Zi and Zj are the 3D locations of the mesh vertex i and j .  Thus the mesh location close to each other are encouraged to take same labels. Pairwise 88
89. 89. Experiments and results 89  Mesh segmentation with the corresponding images of the scene Chap 5, Sec 5.4
90. 90. Results - video 90
91. 91. Evaluation 91  Created ground truth mesh for evaluation [1]. [1] http://www.robots.ox.ac.uk/~tvg/projects/
92. 92. Observations 92  Improved accuracy for mesh based inference over image based labelling and projecting the labels  The pairwise connection respecting mesh connectivity improves labelling Ground Truth Unary only Unary + Pair Image
93. 93. Timing performance 93  Labelling over mesh improves performance in inference stage.  Scene of 150 images of resulotion 1281x376 ≅ 75𝑚𝑙𝑛  Mesh 704K vertex and 1.27m faces  25x speedup in inference at our operating point  Further speedup possible by computing classifier response only for registered pixels to mesh.
94. 94. Inference Time with varying mesh size 94  Mesh created for the same scene with finer granularity.
95. 95.  Note –ground truth mesh generated for each granularity  Varying mesh granularity makes smaller sized mesh face and has effect on pairwise cost Accuracy with varying mesh granularity 95
96. 96. Scene editing 96  Labelling in 3D structure can help to categorize the 3D regions.  Some active scene editing ,e.g. vehicle moving on the road. Chap 5, Sec 5.4
97. 97. Scene edit - dynamic 97
98. 98. Chapter Conclusions 98  Present a mesh based inference for scene labelling.  Inference on mesh provides an accurate and faster approach towards scene labelling.  Presented a classifier score combination method which improves accuracy.  Upto 25x faster in inference stage for outdoor scenes.  Applications – scene editing can be performed once scene is labelled.  However the mesh representation is limiting for various robotic tasks, which we try to overcome in next chapter.
99. 99. Chapter 6 - Hierarchical CRF on an Octree Graph 99  Computer vision – attempts to recognise scene has been studied exhaustively.  Robotics – efficient/accurate 3D representation of scene for various robotic tasks, but little for understanding semantics.  Aim - Join the two hands towards recognition in an efficient representation, and present a method which  Performs jointly recognition and infers occupancy.  Uses hierarchal constraints to perform scene labelling  Uses an efficient 3D representation for determining occupied, free and unknown area. Chap 6, Sec 6.1
100. 100. Good 3D representation 100  Why  Needed for further processing tasks  Robotics domain – mapping, grasping/manipulation, navigation  Graphics domain – efficient rendering over graphics processing unit and visualization  What  Should map accurately  Occupied: Objects present in the world,  Free: required for collision avoidance, path planning.  Unmapped: unknown areas in the scene need to be avoided.  Efficiency: Any 3D volume requires to be identified as free/occupied/unmapped efficiently.
101. 101. Existing 3d representation 101  Storing 3D measurements from sensors through point clouds – cannot map free and unknown area   Mesh – same limitations as pt. clouds   Stixels/Height maps/2.5D : one height value in a 2D grid, but free area not accurately mapped   Fixed sized grid of voxels: Voxels not indexed which makes it inefficient   Octree based volumetric representation – Introduced more than three decade back, represents accurately 3d space, efficient indexing of volume 
102. 102. Octomap - representation 102  Octree representatation  Every voxels/volume divided into 8 subvolume, allowing fast indexing of voxels  Advantageous in comparison to point clouds, surface maps, elevation/2.5d representations  Used widely across computer science  Hardware friendly (cpu, gpu, fpga)  Octomap [a] proposed in 2013  Probabilistic representation of occupied, free and unknown regions  Based on octree based 3d representation  Demonstrated to map large areas though fusion of depth estimates. [a] O Armin Hornung, ctoMap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots, 2013.
103. 103. Multi-resolution approaches in Computer vision 103  Multi-resolution approach used for recognition, classification detection  Information at pixel level, pair of pixels or group of pixels combined together  Robust PN model [1] - penalised label inconsistency over a group of pixels.  Grouping determined through unsupervised image segmentation  Here we extend the multi-resolution image based classification approach to 3D volume indexed through an octree [1], P. Kohli et at. Robust Higher Order Potentials for Enforcing Label Consistency
104. 104. Semantic Octree - framework 104  Input stereo images Chap 6, Sec 6.3
105. 105. Semantic Octree - framework 105  Generate point clouds and class hypothesis for every pixel Chap 6, Sec 6.3
106. 106. Semantic Octree - framework 106  Fuse into an octree through estimated camera  Octree – each volume subdivided in 8 sub-volumes  Leaf- nodes (xi) are the smallest sized voxels  Any internal node (xc) gives a natural grouping of 3D space Chap 6, Sec 6.3
107. 107.  Perform inference over 3D voxels to give labelled scene. Semantic Octree - framework 107 Chap 6, Sec 6.3
108. 108. CRF graph on Octree voxels  Octree divides the space into subvolumes indexed through tree with nodes  τint : Internal nodes in the tree (xc)  τleaf : leaf level voxels (xi)  Random variable for every leaf voxel  Every internal node is associated with a set of leaf voxels resulting in a clique  Label set defined as  Final energy : 108 Chap 6, Sec 6.3
109. 109.  Octree Volume update  All voxels initially set unknown and occupancy probability P(xi) = 0.5 and log odds  For each 3D point (obtained from stereo pairs), voxels’ log odds updated in a ray casting manner  Log odds are updated for all 3D points for every stereo pairs  Final occupancy probability obtained as Unary score for leaf voxels 109 Chap 6, Sec 6.3.1
110. 110. Unary score for leaf voxels  Each occupied voxel xi is associated with a set of 3D pts  The corresponding image pixels denoted as  Pixel scores combined together  Given the initial occupancy P(xi), the unary is given as:  Thus, for every initially estimated occupied voxels have low cost for free label and vice verca 110 Chap 6, Sec 6.3.1
111. 111. Hierarchical tree potential  Robust PN potential applied over hierarchical groupings of voxels  Penalise label inconsistency within the grouping of voxels  Takes the form  Maximum cost truncated to ϒmax  Grouping of voxels correspond to internals nodes in the octree 111 Chap 6, Sec 6.3.2
112. 112. Experiments 112  Octree defined of 16 levels  Smallest resolution of voxels = (8x8x8)cm3  Maximum mapped volume (216 x 8 )3cm ~ 5.243 km3  Hierarchical grouping of voxels corresponding to internal nodes 13-15 considered
113. 113. Results 113  Higherarchial grouping while inference vs leaf level voxel labelling (much sparser) Chap 6, Sec 6.4
114. 114.  Quantitative evaluation :  Performed by projecting into image domain  Observations  Small objects tend to get decimated due to octree quantization hence reduced accuracy  Mesh based representation better in representing surface.  Non-uniform Grouping of volumes (k-d tree) can be used to improve results Results 114
115. 115. Occupancy mapping 115  Grouping of voxels hierarchically increases the occupied volume reducing the sparsity
116. 116. Chapter Conclusion 116  A method to infer jointly object class labels and occupancy mapping proposed  Efficient representation of 3D space for further operations like navigation and manipulation  Octree poses a quantization error which can be approached through grouping of volumes through k-d tree
117. 117. Thesis - Conclusions 117  This thesis covered the aspects of scene understanding and proposed solutions for dense semantic mapping and reconstruction  Chapter 3 – Large scale Dense semantic mapping  Overhead semantic view of an urban region  Experiments to generate ~15km map  One of the first large scale semantic map  Presented as oral in IEEE IROS 2012 Chap 7, Sec 7.1
118. 118. Thesis - Conclusions 118  Chapter 4 – Dense semantic reconstruction  Dense semantic reconstruction from kms of stereo images.  Online volumetric reconstruction to accommodate arbitrarily long road scenes.  Presented as oral in IEEE ICRA 2013  Chapter 5 – Mesh based inference for scene labelling  Improved labelling accuracy (pairwise connections respect mesh connectivity) and consistency.  Depth sensitive classifier fusion.  25x faster in inference time  Presented as poster in CVPR 2013
119. 119. Conclusions 119  Chapter 6 – Hierarchical CRF on an Octree Graph  Unified framework to determine 3D volume occupancy and with object class labels in the scene.  Efficient representation  Robust PN potential over octree volumes  Datasets (available publicly)  Yotta labelled dataset: multiview street images (urban, rural, highway) containing 8000+ images, with object class labellings  Kitti Labelled dataset: Object class labelling for publicly available KITTI dataset
120. 120. Way forward 120  Transfer learning – so many datasets with so many labellings. Should aim to learn from multiple source and apply in test cases.  Life long learning – an agent needs to identify the object irrespective of changes in environment  Exploit High level attributes  Need to investigate for an end-to-end real-time pipeline for dense recognition, reconstruction  Exploit scene dynamics – DVS (dynamic vision systems) give only modified pixels through efficient sensors. Chap 7, sec 7.2
121. 121. Thank you 121  Acknowledgements  Supervisors: Philip Torr and David Duce  Thesis Examiners: Gabriel Brostow and Nigel Crook  Collaborators: Paul Sturgess, Lubor Ladicky, Ali Shahrokni, Eric Greeveson, Julien Valentin, Ziming Zhang, Johnathan Warrell, Chris Russell, Yalin Bastanlar, William Clocksin, Vibhav Vineet, Mike Sapi.
122. 122. References 122  Lubor Ladicky et. al. Associative hierarchical crfs for object class image segmentation. ICCV, 2009, PAM13  Pushmeet Kohli et. Al Robust Higher Order Potentials for Enforcing Label Consistency, IJCV 09  Paul Sturgess et. Al. Combining Appearance and Structure from Motion Features for Road Scene Understanding, BMVC 09  Lubor Ladicky et. al. Joint optimisation for object class segmentation and dense stereo reconstruction. BMVC, 2010, IJCV 12  Richard A. Newcombe et. al. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE ISMAR 2011.
123. 123. 123