• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Vincent Lepetit - Real-time computer vision
 

Vincent Lepetit - Real-time computer vision

on

  • 3,206 views

 

Statistics

Views

Total Views
3,206
Views on SlideShare
2,305
Embed Views
901

Actions

Likes
2
Downloads
69
Comments
1

3 Embeds 901

http://summerschool2011.graphicon.ru 586
http://jhseo.tistory.com 314
http://almada2013.ru 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • If to follow directly m'=Hm.
    The straightforward derivation of this should give:
    u'=u''/w'' = (H11 u + H12 v + H13)/w''
    v'=v''/w'' = (H21 u + H22 v + H23)/w''
    w'' = H31 u + H32 v + H33
    that leads to a system in the slide (16), but with the [H31 H32 H33] taken with the negative sign. However, this sign can be moved to a third row of H, i.e (H31, H32, H33 )
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Vincent Lepetit - Real-time computer vision Vincent Lepetit - Real-time computer vision Presentation Transcript

    • Real-Time Computer Vision Microsoft Computer Vision SchoolVincent Lepetit - CVLab - EPFL (Lausanne, Switzerland) 1
    • demo 2
    • applications ... 3
    • • How the demo works (including Randomized Trees);• More recent work. 4
    • Background• 3D world to 2D images (projection matrix, internal parameters, external parameters, homography, ...);• Robust estimation (non-linear least-squares, RANSAC, robust estimators, ...);• Feature point matching (affine region detectors, SIFT, ...). 5
    • From the 3D World to a 2D Image World coordinate system M mWhat is the relation between the 3D coordinates of a point M and itscorrespondent m in the image captured by the camera ? 6
    • Perspective Projection World coordinate system M m C Camera centerThe image formation is modeled as a perspective projection, which isrealistic for standard cameras:The rays passing through a 3D point M and its correspondent m in theimage all intersect at a single point C, the camera center. 7
    • Expressing M in the Camera Coordinates System World coordinate M system Mcam m Z C X Y Camera coordinate systemStep 1: Express the coordinates of M in the camera coordinates system as Mcam.This transformation corresponds to a Euclidean displacement (a rotation plus atranslation): Mcam = RM + Twhere:R is a 3x3 rotation matrix, andT is a 3- vector. 8
    • Homogeneous Coordinates ⎛ X ⎞ ⎛ X ⎞ ⎜ ⎟ ⎜ ⎟ Y World coordinate ˜ = ⎜ ⎟ M = ⎜ Y ⎟ → M ⎜ ⎟ ⎜ Z ⎟ system ⎝ Z ⎠ ⎜ ⎟ ⎝ 1 ⎠ Mcam m € Z C X Camera coordinate system Y ˜Lets replace M by the 4- homogeneous vector M :Just add a 1 as the fourth coordinate.Now, the Euclidean displacement can be expressed as an linear transformationinstead of an affine one: € ⎛ X cam ⎞ € ⎛ ⎛ X ⎞ X ⎞ ⎛ X cam ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ = RM + T → ⎜ Ycam ⎟ = R⎜ Y ⎟ + T → ⎜ Ycam ⎟ = (R | T)⎜ Y ⎟ → M ˜ M cam cam = (R | T)M ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ Z ⎟ ⎝ Z cam ⎠ ⎝ Z ⎠ ⎝ Z cam ⎠ ⎜ ⎟ (R | T ) is a 3x4 matrix. ⎝ 1 ⎠ 9
    • Projection Mcam m f Z Z Mcam C X Y Camera coordinate system mXComputation of the coordinates of m in the imageplane, from Mcam (expressed in the camera coordinatessystem): Simply use Thales theorem: f mX X X = → mX = f f Z Z mX X C 10
    • From Projection to ImageCoordinates of m in pixels ? 1 ku ) Image coordinate 1 pixel ) 1 kv system u € u0 v X Y mX = f , mY = f v0 m Z Z € f mu = u0 + kumX , mv = v 0 + k v mY C Camera coordinate € system € 11
    • Putting• the perspective projection and• the transformation into pixel coordinatestogether: X Y mX = f , mY = f Z Z mu = u0 + kumX , mv = v 0 + k v mY In matrix form : ⎛ u ⎞ ⎛ k u f 0 u0 ⎞⎛ X ⎞ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ v ⎟ = ⎜ 0 kv f v 0 ⎟⎜ Y ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎝ w ⎠ ⎝ 0 0 1 ⎠⎝ Z ⎠ ⎛ u ⎞ ⎧ u X ⎜ ⎟ ⎪mu = w = u0 + k u f Z v ⎟ defines m in homogeneous coordinates → ⎨ ⎪mv = v = v 0 + k v f Y ⎜ ⎜ ⎟ ⎝ w ⎠ ⎩ w Z 12
    • The Full Transformation The two transformations are chained to form the full transformation from a 3D point in the world coordinate system to its projection in the image: ⎛ X ⎞ ⎛ u ⎞ ⎛ k u f 0 u0 ⎞⎛ R11 R13 R13 T ⎞⎜ ⎟ 1 ⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ Y ⎟ ⎜ v ⎟ = ⎜ 0 kv f v 0 ⎟⎜R 21 R 22 R 23 T2 ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ Z ⎟ ⎝ w ⎠ ⎝ 0 0 1 ⎠⎝R 31 R 32 R 33 T3 ⎠⎜ ⎟ ⎝ 1 ⎠ ⎛ X ⎞ ⎛ P11 P12 P13 P14 ⎞⎜ ⎟ ⎜ ⎟⎜ Y ⎟ = ⎜ P21 P22 P23 P24 ⎟ ⎜ ⎟⎜ Z ⎟ ⎝ P31 P32 P33 P34 ⎠⎜ ⎟ ⎝ 1 ⎠ projection matrix€ The product of the internal calibration matrix and the external calibration matrix is a 3x4 matrix called the "projection matrix". The projection matrix is defined up to a scale factor. 13
    • The Full Transformation ⎛ X ⎞ ⎛ u ⎞ ⎛ k u f 0 u0 ⎞⎛ R11 R13 R13 T ⎞⎜ ⎟ 1 ⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ Y ⎟ ⎜ v ⎟ = ⎜ 0 kv f v 0 ⎟⎜R 21 R 22 R 23 T2 ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ Z ⎟ ⎝ w ⎠ ⎝ 0 0 1 ⎠⎝R 31 R 32 R 33 T3 ⎠⎜ ⎟ ⎝ 1 ⎠ ⎛ X ⎞ ⎛ P11 P12 P13 P14 ⎞⎜ ⎟ ⎜ ⎟⎜ Y ⎟ = ⎜ P21 P22 P23 P24 ⎟ ⎜ ⎟⎜ Z ⎟ ⎝ P31 P32 P33 P34 ⎠⎜ ⎟ ⎝ 1 ⎠ projection matrix€ R, T, and the products kuf and kvf can be extracted from the projection matrix. 14
    • Homography H3×3 ￿ M/m m     X X    Y    X Y m￿ = PM = [P1 P2 P3 P4 ]   = [P1 P2 P3 P4 ]   Z   = [P1 P2 P4 ]  Y  0  1 1 1 = H3×3 m 15
    • Computing a Projection Matrix or a Homography from Point Correspondences by solving a linear system ￿ m m  m￿ = Hm H11  H12 m = [u, v, 1]￿ , m￿ = [u￿ , v ￿ , 1]￿   H13     ￿ ￿ H21  ￿ ￿ u v 1 0 0 0 uu￿ vu￿ u  ￿   = 0  H22  0 0 0 u v 1 uv ￿ vv ￿ v ￿   0  H23   H31     H32  H33 16
    • Computing a Projection Matrix or a Homography fromPoint Correspondences with a non-linear optimization• Non-linear least-squares minimization: Minimization of a physical, meaningful error (reprojection error, in pixels) M m ￿ HR,T m PR,T m m m￿ ￿ ￿ ￿ ￿ ￿ ￿ 2min 2 dist PR,T Mi , m￿ i min dist HR,T mi , mi ￿R,T R,T i i• Minimization algorithms: Gauss-Newton or Levenberg-Marquardt (very efficient). 17
    • A Look to the Reprojection Error 1D camera under 2DTrue camera position at translation (0, 0) reprojection error 100 "3D points" taken at randomly in [400;1000]x[-500;+500] 18
    • Gaussian Noise on the Projections White cross: true camera position; Black cross: global minimum of the objective function. In that case, the global minimum of the objective function is close to the truecamera pose. 19
    • What if there are Outliers ? M3 M1 M2 M4 m1 m3 m4 incorrect measure m2 (outlier) C 20
    • Gaussian Noise on the Projections + 20% outliersWhite cross: true camera position;Black cross: global minimum of the objective function.The global minimum is now far from the true camera pose. 21
    • What Happened ? Bayesian interpretation: ￿ 2 ￿ ￿ argmin i dist PR,T mi , mi ￿ R,T ￿ ￿ ￿ ￿ = argmax i N mi ; PR,T mi , σI R,T M3 The error on the 2D point locations mi is assumed tohave a Gaussian (Normal) distribution with identical M1covariance matrices σI, and independent; M2 M4 m1 This assumption is violated when mi is an outlier. m3 m 4 m2 C 22
    • Robust estimationIdea: Replace the Normal distribution by a more suitable distribution, or equivalently replace the least-squares estimator by a "robust estimator" or “M-estimator”: ￿ 2 ￿ ￿ argmin dist PR,T mi , mi i ￿ R,T ￿ ￿ ￿ ￿￿ → argmin i ρ dist PR,T mi , m￿ i R,T 23
    • Example of an M-estimator: The Tukey Estimator￿ c2 if |x| ≤ c ρ(x) = 6 (1 − (1 − (c) ) ) x 2 3 c2 if |x| > c ρ(x) = 6 x2 ρ(x)The Tukey estimator assumes the measures follow adistribution that is a mixture of:• a Normal distribution, for the inliers,• a uniform distribution, for the outliers. 24
    • Normal distribution Uniform distribution (inliers) (outliers) Mixture + = -log(.) -log(.)Least-squares Tukey estimator 25
    • Gaussian Noise on the Projections + 20% outliers + Tukey estimatorWhite cross: true camera position;Black cross: global minimum of the object function.The global minimum is very close to the true camera pose.BUT:- local minimums;- the objective function is flat where all the correspondences are considered outliers. 26
    • Gaussian Noise on the Projections + 50% outliers + Tukey estimatorEven more local minimums.Numerical optimization can get trapped into a local minimum. 27
    • RANSAC 28
    • How to Optimize ?Idea: sampling the space of solutions (the camera pose space here): 29
    • How to Optimize ?Idea: sampling the space of solutions:+ Numerical Optimization from the best sampled pose.Problem: Exhaustive regular sampling is too expensive in 6 dimensions.Can we do a smarter sampling ? 30
    • RANSAC RANSAC: RANdom SAmple Consensus Line fitting: the "Throwing Out the worst residual" heuristics can fail (Example forthe original paper [Fischler81]): outlier final least-squares solution Ideal line 31
    • RANSACAs before, we could do a regular sampling, but would not be optimal: Ideal line 32
    • Idea: Generate hypotheses from subsets of the measurements. If a subset contains no gross errors, the estimated parameters (the hypothesis) are closedto the true ones. Take several subsets at random, retain the best one. Ideal line 33
    • The quality of a hypothesis is evaluated by the number of measures that lie "closeenough" to the predicted line. We need to choose a threshold (T) to decide if the measure is "close enough". RANSAC returns the best hypothesis, ie the hypothesis with the largest number ofinliers. T ⎧1 if dist(mi ,line(p)) ≤ T ∑ ⎨0 if dist(mi ,line(p)) > T i ⎩ € 34
    • RANSAC for Homographies To apply RANSAC to homography estimation, we need a way to compute ahomography from a subset of measurements:   H11  H12     H13    ￿ ￿ H21  ￿ ￿ u v 1 0 0 0 uu￿ vu￿ u￿    = 0  H22  0 0 0 u v 1 uv ￿ vv ￿ v ￿   0  H23   H31     H32  H33 Since RANSAC only provides a solution estimated with a limited number of data, itmust be followed by a robust minimization to refine the solution. 35
    • How to Get the Correspondences ? m m ￿• Extract Feature Points / Keypoints / Regions (Harris corner detector, extrema of Laplacian, affine region detectors, ...);• standard approach: Match them based on Euclidean distances between descriptors such as SIFT, SURF, ... 36
    • Affine Region DetectorsHessian-Affine detector MSER detector 37
    • Affine NormalizationWarp by M11/2 Warp by M21/2 We still have to correct for the orientation ! 38
    • Select Canonical Orientation• Create histogram of local gradient directions computed over the image patch;• Each gradient contributes for its norm, weighted by its distance to patch center;• Assign canonical orientation at peak of smoothed histogram. 0 2π 39
    • Select Canonical Orientation 40
    • Description Vector      .  ?  .   .  41
    • SIFT Description VectorMade of local histograms of gradients:      .   .   . In practice: 8 orientations x 4 x 4 histograms = 128 dimensions vector.Normalised to be robust to light changes. 42
    • Matching Regions       .   .   .   .  .  .            .   .   .   .   .   .  ? ￿  .  .  .  m      .   .   .       .   .   .  43
    • Matching: Approximate Nearest NeighbourBest-Bin-First: Approximate nearest-neighbour search in k-d tree 44
    • Keypoint MatchingThe standard approach is a particular case of classification: Search in the Database Pre-processing Nearest neighbor Make the actual classification easier classification Idea: let’s try another classification method! 45
    • One Class per Keypoint One class per keypoint: the set of the keypoint’s possible appearancesunder various perspective, lighting, noise... class 1 class 2 46
    • Training phaseclass 1 Classifierclass 2... Run-Time class 1 Classifier 47
    • Which Classifier ? We want a classifier that:• can handle many classes;• is very fast;• has reasonable recognition performances (a very high recognition rate is not an necessary requirement). 48
    • Which Classifier ?• Randomized Trees [Amit & Geman, 1997];• Random forests [Breiman, 2001]. 49
    • An (Ideal) Single Tree binary testbinary test binary test class # 50
    • How to Build the Tree ?training set binary test ? 51
    • training S set binary test ? found by minimizing the entropy after the test: |Sleft | |Sright | argmin |S| Entropy(Sleft ) + |S| Entropy(Sright ) test Sright Sleft 52
    • training S set binary testProblem: runs quickly out of training samples for thedeeper tests 53
    • Idea: Use Several Sub-Optimal TreesEach tree is trained with a random subset ofthe training set. 54
    • Idea: Use Several Sub-Optimal TreesThe leaves contain the probabilities over theclasses, computed from the training set. 55
    • Classification with Several Sub-Optimal TreesThe test sample is dropped into each tree, and theprobabilities in the leaves it reached are averaged: 1 3 ( + + )= 56
    • Visual InterpretationEach tree partitions the space in a different way andcompute the probability of each class for each cell ofthe partition: 57
    • Visual InterpretationCombining the trees gives a fine partition with abetter estimate of the class probabilities: 58
    • For Patches Possible tests: compare the intensities of two pixels around the keypoint after Gaussian smoothing:m + dmi,1 ￿ ˜ ˜ 1 if I(m + dmi,1 ) ≤ I(m + dmi,2 ) fi = 0 otherwise m ˜ m + dmi,2 I : image after Gaussian smoothing • Very efficient to compute; • Invariant to light change by any raising function. 59
    • Results 60
    • Randomized Trees (and Random Ferns) applied to imagepatches are becoming a powerful tool for Computer Vision. 61
    • [Shotton et al, CVPR’11]Used to infer body parts in the Kinect body tracking system.The tests rely on the depth map. 62
    • Tests in [Shotton et al, CVPR’11]Classes are the body parts. The goal is to label each pixel with thelabel of the part it belongs to.Tests compare the depth of two pixels around the considered pixel.The displacements are normalized by the depth of the consideredpixel for invariance: ￿ 1 if depth(m + depth(m) ) ≤ depth(m + depth(m) ) dm1 dm2 fi (m) = 0 otherwise m 63
    • 3D Pose EstimationMean-Shift is used to find the joint locations from the body parts. 64
    • TrainingMost of the training data is synthetic:“Training 3 trees to depth 20 from 1 million images takes about1 day on a 1000 core cluster” [Shotton et al, CVPR’11] 65
    • A SubtreeAverage of the patchesthat reach this node 66
    • [Gall et Lempitsky, CVPR’09; Barinova et al, CVPR’10] Hough Forest for Object Detection:• Random Forests used to make each patch vote for the object centroid;• The tests compare the output of filters and histograms-of-gradient between 2 pixels;• The leaves contain the displacement toward the object center.Each patch votes for Votes from the 3 Accumulated votes Final detection the object centroid patches from all patches 67
    • Tests used in [Gall et Lempitsky, CVPR’09] ￿ 1 if channeli (m + dm1 ) < channeli (m + dm2 ) + τfi (m) = 0 otherwise Channels: the 3 color channels, absolute values of the first and second derivatives of the image, and 9 channels from HoG (Histograms-of-Gradients). HoG 68
    • [Bosch et al, ICCV’07]Image Classification using Random Forests and Ferns [Bosch et al,ICCV’07]Use a sliding window to detect objects.Much faster than SVMs, recognition performances similar. 69
    • [Bosch et al, ICCV’07]Tests: ￿ 1 if n xm + b ≤ 0 ￿ fi (m) = 0 otherwisen and b: random vector and scalar.xm: vector computed from a Pyramidal Histogram-of-Gradients. 70
    • [Kalal et al, CVPR’10] TLD (aka Predator), for Track, Learn, Detect:• Random Ferns used to speed up detection;• Trained online: the distributions in the leaves are updated online, using the incoming images. 71
    • [Kalal et al, CVPR’10]• Tests: 2bit binary patterns• Trained online: the distributions in the leaves are updated online, using the incoming images. 72
    • Random Ferns:A Simplified Tree-Like Classifier 73
    • For Keypoint Recognition, We Can Use Random Tests!Comparison of the recognition rates for 200 keypoints: Recognition rate tests selected by minimizing entropy tests with random locations Number of trees 74
    • We can use random tests• For a small number of classes – we can try several tests, and – retain the best one according to some criterion. 75
    • We can use random tests• For a small number of classes – we can try several tests, and – retain the best one according to some criterion.• When the number of classes is large – any test does a decent job: 76
    • Why it is Interesting• Building the trees takes no time (we still have to estimate the posterior probabilities);• Allows incremental learning;• Simplifies the classifier structure. 77
    • The Tree Structure is not Needed 78
    • The Tree Structure is not Needed f1 f2 f3 79
    • The Tree Structure is not Needed f1 f2 f3The distributions can be expressed simply, as: Results of pixel comparisons (0 or 1) Class Label 80
    • We are looking for argmax P(C = c i patch) i If patch can be represented by a set of image features { fi }: P(C = c i patch) = P(C = c i f1, f 2 ,… f n , f n +1,… … f N ) € which is proportional to€ but complete representation of the joint distribution infeasible. Naive Bayesian ignores the correlation: Compromise: 81
    • Training 82
    • Training 83
    • Training 0 1 16 84
    • Training 0 1 1 0 1 0 16 85
    • Training 0 1 1 1 0 0 1 0 1 1 56 86
    • Training 87
    • Training 88
    • Training Results Normalize: ∑ P( f , f ,…, f 1 2 n | C = ci ) = 1 000 001  111€ 89
    • Training Results Normalize: ∑ P( f , f ,…, f 1 2 n | C = ci ) = 1 000 001  111€ 90
    • Recognition 91
    • Normalization Normalize: ∑ P( f , f ,…, f 1 2 n | C = ci ) = 1 000 001  111€ 92
    • Subtlety with Normalization Number of samples(leaf, class) pleaf, class = Number of samples(class) too selective: Number of samples(leaf, class) can be 0 simply because the training set is finite. we use: pleaf, class = Number of samples(leaf, class)+Nregularization Number of samples(class)+Number of leaves×Nregularization This can be done by simply initializing the counters to Nregularization instead of 0. 93
    • Influence of Nregularizationpleaf, class = Number of samples(leaf, class)+NregularizationNumber of samples(class)+Number of leaves×Nregularization Recognition rate50% Nregularization (log scale) 94
    • Implementation of Feature Point Recognition with Ferns 1: for(int i = 0; i < H; i++) P[i] = 0.; 2: for(int k = 0; k < M; k++) { 3: int index = 0, * d = D + k * 2 * S; 4: for(int j = 0; j < S; j++) { 5: index <<= 1; 6: if (*(K + d[0]) < *(K + d[1])) 7: index++; 8: d += 2; } 9: p = PF + k * shift2 + index * shift1;10: for(int i = 0; i < H; i++) P[i] += p[i]; }• Very simple to implement;• No need for orientation, perspective, light correction. 95
    • Ferns versus SIFTNumberof inliersfor SIFT each point corresponds to an image from a 1000-frame sequence Number of inliers for FernsFerns are much faster, sometimes more accurate,but SIFT does not need training. 96
    • Randomized Trees vs Ferns Different combination strategies: average (RT) / product (Ferns) Ferns with productRecognition rate RT (with random tests) with product Ferns with average RT (with random tests) with average Number of structures Ferns more discriminant but more sensitive to outliers. 97
    • Randomized Trees vs Ferns Influence of the number of classes:Recognition rate Ferns with product Ferns with average 98
    • Memory and Computation Time• Recognition time grows linearly with the number of Trees/Ferns and the number of classes.• Recognition time grows linearly with the logarithm of the depth of Trees/Ferns.• Memory grows linearly with the number of Trees/Ferns and the number of classes.• Memory grows exponentially with the depth of Trees/Ferns.• Increasing the depth may result in overfitting.• Increasing the number of Trees/Ferns (usually) improves recognition. 99
    • Influence of the Number of Ferns Ferns with productRecognition rate RT (with random tests) with product Ferns with average RT (with random tests) with average Number of structuresIncreasing the number of Ferns/Trees improves therecognition rate, but increases the computation time andmemory. 100
    • Number of Ferns / Number of Leaves / Memory / Computation Time Number of Ferns Computation TimeRecognition Rate Fern size Fern size 101
    • Conclusions on Randomized Trees and Ferns• Simple to implement, Ferns even simpler;• Both very fast, but dumb: need a lot of training examples to learn.• Use a lot of memory to store the posterior distributions in the leaves. 102
    • We now have correspondences between a reference imageof the object and the input image:Some correspondences are correct, some are not.We can estimate the homography between the 2 images byapplying RANSAC on subsets of 4 correspondences. 103
    • Computing a Homography from Point Correspondences by solving a linear system ￿ m m   m = Hm ￿ ￿ ￿ H11  H12  m = [u, v, 1] , m = [ku , kv , k] ￿ ￿ ￿ ￿ ￿ ￿ ￿   H13     ￿ ￿ H21  ￿ ￿ u v 1 0 0 0 uu￿ vu￿ u  ￿   = 0  H22  0 0 0 u v 1 uv ￿ vv ￿ v ￿   0  H23   H31     H32  H33 104
    • Computing a Homography from Point Correspondences by solving a linear system ￿ m m m￿ = Hm ￿ ￿ m = [u, v, 1] , m = [ku , kv , k] ￿ ￿ ￿ ￿ ￿ ￿ ￿  ￿  u = H1 1u+H1 2u+H1 3 H3 1u+H3 2u+H3 3  v = ￿ H2 1u+H2 2u+H2 3 H3 1u+H3 2u+H3 3 105
    • Computing a Homography from Point Correspondences by solving a linear system  ￿  u = H1 1u+H1 2u+H1 3 H3 1u+H3 2u+H3 3  v￿ = H2 1u+H2 2u+H2 3 H3 1u+H3 2u+H3 3   H11  H12     H13    ￿ ￿ H21  ￿ ￿ u v 1 0 0 0 uu￿ vu￿ u￿    = 0  H22  0 0 0 u v 1 uv ￿ vv ￿ v ￿   0  H23   H31     H32  H33Using four correspondences: BX = 08 with X = [H11 , H12 , H13 , H21 , H22 , H23 , H31 , H32 , H33 ]￿ 106
    • How to Solve this Linear System ?BX = 08with X = [H11 , H12 , H13 , H21 , H22 , H23 , H31 , H32 , H33 ]￿• X is the null eigenvector of B.• In practice: the eigenvector corresponding to the smallest eigenvalue. 107
    • Computing a Homography from Point Correspondences with a non-linear optimization• Non-linear least-squares minimization: Minimization of a physical, meaningful error (reprojection error, in pixels) HR,T m ￿ m m ￿ ￿ ￿ 2 min dist HR,T mi , mi ￿ R,T i• Minimization algorithms: Gauss-Newton or Levenberg-Marquardt (very efficient). 108
    • Numerical Optimization p2 p1 p0 Start from an initial guess p0: p0 can be taken randomly but should be as close as possible to the globalminimum: - pose computed at time t-1; - pose predicted from pose computed at time t-1 and a motion model; - ... 109
    • Numerical OptimizationGeneral methods:• Gradient descent / Steepest Descent;• Conjugate Gradient;• ...Non-linear Least-squares optimization:• Gauss-Newton;• Levenberg-Marquardt;• ... 110
    • Numerical Optimization We want to find p that minimizes: ￿ 2 ￿ ￿ E(p) = i dist HR(p),T(p) mi , mi ￿ = ￿f (p) − b￿2 where• p is a vector of parameters that define the camera pose (translation vector +parameters of the rotation matrix);• b is a vector made of the measurements (here the m’i);• f is the function that relates the camera pose to these measurements.     u(HR(p),T(p) m1 ) u(m￿ ) 1  v(HR(p),T(p) m1 )   v(m￿ )  f (p) =   b= 1  . . . . . . 111
    • Gradient descent / Steepest Descentp i+1 = p i − λ∇E (p i ) 2 T E(p i ) = f (p i ) − b = ( f (p i ) − b) ( f (pi ) − b)→ ∇E(p i ) = 2J( f (p i ) − b) with J the Jacobian matrix of f , computed at p iWeaknesses:- How to choose λ ?- Needs a lot of iterations in long and narrow valleys: 112
    • The Gauss-Newton and the Levenberg-Marquardt algorithmsBut first, the Linear Least-Squares Case: 2 E(p) = f (p) − bIf the function f is linear ie f(p) = Ap, p can be estimated as: p=A+b €where A is the pseudo-inverse of A: A+=(ATA)-1AT + 113
    • Non-Linear Least-Squares: The Gauss-Newton algorithm Iteration steps: pi+1=pi + ∆i ∆i is chosen to minimize the residual || f(pi+1) – b ||2. It is computed by approximating f to the first order: 2 Δi = argmin f (pi + Δ) − b Δ 2 = argmin f (pi ) + JΔ − b First order approximation : f (pi + Δ) ≈ f (pi ) + JΔ Δ 2 = argmin εi + JΔ εi = f (pi ) − b denotes the residual at iteration i Δ Δ i is the solution of the system JΔ = −εi in the least − squares sense : Δ i = −J +εi where J + is the pseudo - inverse of J€ 114
    • Non-Linear Least-Squares: The Levenberg-Marquardt AlgorithmIn the Gauss-Newton algorithm: −1 Δ i = −( J J) JTεi TIn the Levenberg-Marquardt algorithm: € T −1 T Δ i = −( J J + λI) J εiLevenberg-Marquardt Algorithm:0. Initialize λ with a small value: λ = 0.001 €1. Compute ∆i and E(pi + ∆i)2. If E(pi + ∆i) > E(pi): λ ← 10 λ and go back to 1 [happens when the linear approximation of f is too coarse]3. If E(pi + ∆i) < E(pi): λ ← λ / 10, pi+1 ← pi + ∆i and go back to 1.Once converged, set λ ← 0 and continue up to convergence. 115
    • Non-Linear Least-Squares: the Levenberg-Marquardt Algorithm T −1 T Δ i = −( J J + λI) J εi• When λ is small, LM behaves similarly to the Gauss-Newton algorithm. €• When λ becomes large, LM behaves similarly to a steepest descent to guarantee convergence. 116
    • Another Way to Refine the Pose: Template Matching 117
    • Global region tracking by minimizing cross-correlation:• Useful for objects difficult to model using local features;• Accurate. p Input Image I Template T 118
    • Lucas-Kanade Algorithm 2 min ∑ (W (I,p)[m j ] − T[m j ]) p j mj p Input Image I Template T Gauss-Newton step: + Δ i = J ⋅ εp,I pPseudo-inverse of the Jacobian of εp,I = (…, T[m j ] − W (I,p)[m j ],…) TW(I, p) evaluated at p and the mj € 119
    • Lucas-Kanade Algorithm p p0 Template TComputing J and J+ is computationally expensive. 120
    • Inverse Compositional Algorithm [Baker et al. IJCV03] pi = pi-1 + dpi Input Image It dpi -pi-1 Template T + dpi = J p= 0 εp= 0,IJp=0 is a constant matrix and can therefore be precomputed ! 121
    • ESM (Efficient Second-order Method)(1) I = T + Jp=0dp + dpTHp=0dp [second-order Taylor expansion](2) Jp=dp = Jp=0 + 2dpTHp=0 [derivation of (1) wrt p](3) dpTHp=0 = ½(Jp=dp - Jp=0) [from Equation (2)](4) I = T + Jp=0 + ½(Jp=dp - Jp=0)dp [by injecting (3) in (1)](5) dp = [½(Jp=0 + Jp=dp)]+ (I - T) [from Equation (4)]Like Gauss-Newton but replace Jp=0 by ½(Jp=0 + Jp=dp).Need to compute Jp=dp at each iteration, and a pseudo-inverse at each iteration, but need much less iterations. 122
    • BRIEF [ECCV’10]very fast feature point descriptor 123
    • Remark• Moving legacy code to new CPUs does not result in a speed-up anymore;• Should consider the features of new platforms: parallelism (multi-cores, GPU), locality, ... 124
    • 1 1Gaussian 0smoothing ... 0 1 BRIEF descriptor 125
    • 1 1 Gaussian 0 smoothing ... 0Alternatively, using 1integral images: BRIEF descriptor 126
    • Integral Images ￿ ￿Integral Image(u, v) = Image(i, j) i=1..u j=1..v Integral Image 127
    • How to Use Integral Images - - = + 128
    • [Viola & Jones, IJCV’01]Features computed in constant time 129
    • Computing Integral ImagesIntegralImage[u][v] = IntegralImage[u][v-1] + LineBuffer[u] + Image[u][v] 130
    • Evaluation 131
    • Evaluation 132
    • Computation SpeedFor BRIEF, most of the time is spent in Gaussian smoothing. 133
    • Matching Speed distance(BRIEF descriptor 1, BRIEF descriptor 2)= Hamming distance(BRIEF descriptor 1, BRIEF descriptor 2)= number of bits set to 1(BRIEF descriptor 1 xor BRIEF descriptor 2)= popcount(BRIEF descriptor 1 xor BRIEF descriptor 2)10- to 15-fold speed increase on Intels Bloomfield (SSE 4.2)and AMDs Phenom (SSE 4a) 134
    • Matching Speed 135
    • Picking the Locationsuniform distribution Gaussian distribution Gaussian distribution for location and length uniform distribution census transform on Polar coordinates locations 136
    • Picking the Locationsuniform distribution Gaussian distribution Gaussian distribution for location and length uniform distribution census transform on Polar coordinates locations 137
    • Rotation and Scale Invariance 138
    • Rotation and Scale InvarianceDuplicate the Descriptors:18 rotations x 3 scales ... ... ... 139
    • code released in GPL on CVLab website 140
    • DOT [CVPR’10] dense descriptor for object detectionJoint work with Stefan Hinterstoisser (TU Munich) 141
    • object detection with a sliding window and template matchingTemplate matching with an efficientrepresentation of the images and the templates. 142
    • 143
    • 144
    • Initial Similarity Measure 145
    • Making the Similarity Measure Robust to Small Motions 146
    • Downsampling 147
    • Ignoring the Dependencies between the Regions... 148
    • Lists of Dominant Orientations 149
    • Fast Computation with Bitwise Operations00010000 00001100 150
    • Code available under LGPL license athttp://campar.in.tum.de/personal/hinterst/index/ 151
    • New Method, LINE[PAMI, under revision] 152
    • Initial Similarity Measure ￿￿ ￿ ￿ ￿￿ ￿ ESteger (I, O, c) = ￿ cos orientation(O, r) − orientation(I, c + r) ￿ rprevious measure: 153
    • Making the Similarity Measure Robust to Small Motions ￿￿ ￿ ￿ ￿￿ ￿ ESteger (I, O, c) = ￿ cos orientation(O, r) − orientation(I, c + r) ￿ r ￿￿ ￿ ￿ ￿￿E(I, O, c) = max ￿ cos orientation(O, r) − orientation(I, t) ￿ t∈region(c+r) r 154
    • Avoiding to Recompute the max Operator 1. spread the gradients 155
    • 2. precompute response mapsBecause• we consider only a discrete set of gradient directions;• we do not consider the gradient norms,we can precompute a response for each region in the image and a gradientdirection for the template in the template in the template 156
    • Optimized Version1. The sets of orientations in the image regions are encoded with abinary representation: 11010 157
    • Optimized Version2. The binary representation is used as an index to lookup tables withthe precomputed responses for each gradient direction in the template: 158
    • Avoiding Caches MissesThe response maps are re-arranged into linear memories: 159
    • Using the Linear Memories The similarity measure can be computed for all the image locations by summing linear memories, shifted by an offset that depends on the template. 160
    • Advantage of Linearizing the Memory Speed-up factor 161
    • DOT [CVPR’10]LINE 162
    • LINE-MOD [Hinterstoisser et al, ICCV’11]Extension to the Kinect: the templates combine the imageand the depth map. 163
    • thanks! 164