Lecture 5

327 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
327
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lecture 5

  1. 1. C OMPUTER V ISION : L EAST S QUARES M INIMIZATION IIT Kharagpur Computer Science and Engineering, Indian Institute of Technology Kharagpur. (IIT Kharagpur) Minimization Jan ’10 1 / 35
  2. 2. Solution of Linear equationsConsider a system of equations of the form Ax = b. Let A be an m × nmatrix. If m < n there are more unknowns than equations. In this case there will not be a unique solution, but rather a vector space of solutions. If m = n there will be a unique solution as long as A is invertible. If m > n there will be more equations than unknowns. In general the system will not have a solution. (IIT Kharagpur) Minimization Jan ’10 2 / 35
  3. 3. Least-squares solution Full rank caseConsider the case m > n and assume that A is of rank n. We seek avector x that is closest to providing a solution to the system Ax = b. We seek x such that ||Ax − b|| is minimized. Such an x is known as the least squares solution to the over-determined system. We seek x that minimizes ||Ax − b|| = ||UDV T x − b|| Because of the norm preserving property of orthogonal transforms, ||UDV T x − b|| = ||DV T x − U T b|| Writing y = V T x and b = UT b the problem becomes one of minimizing ||Dy − b || where D is a diagonal matrix. (IIT Kharagpur) Minimization Jan ’10 3 / 35
  4. 4. b1      d1     b2        .   d2     .          y1  .     ..         .            y2      bn            .       =   dn   .                 .                bn+1       yn           .   0     .       .                 bmThe nearest Dy can approach to b is the vector(b1 , b2 , . . . , bn , 0, . . . , . . . , 0)TThis is achieved by setting yi = bi /di for i = 1, . . . , nThe assumption rank A = n ensures that di 0Finally x is retrieved from x = Vy. (IIT Kharagpur) Minimization Jan ’10 4 / 35
  5. 5. Algorithm Least Squares Objective: Find the least-squares solution to the m × n set of equations Ax = b, where m > n and rank A= n. Algorithm: (i) Find the SVD A = UDV T (ii) Set b = U T b (iii) Find the vector y defined yi = bi /di , where di is the i th diagonal entry of D (iv) The solution is x = Vy (IIT Kharagpur) Minimization Jan ’10 5 / 35
  6. 6. Pseudo Inverse Given a square diagonal matrix D, we define its pseudo-inverse to be the diagonal matrix D+ such that + 0 if Dii = 0 Dn = −1 otherwise Dii For an m × n matrix A with m ≥ n, let the SVD of A = UDV T . The pseudo-inverse of matrix A is A + = VD+ U TThe least-squares solution to an m × n system of equations Ax = b ofrank n is given by x = A+ b. In the case of a deficient-rank system,x = A+ b is the solution that minimizes ||x||. (IIT Kharagpur) Minimization Jan ’10 6 / 35
  7. 7. Linear least-squares using normal equationsConsider a system of equations of the form Ax = b. Let A be an m × nmatrix. m > n. In general, no solution x will exist for this set of equations. Consequently, the task is to find the vector x that minimizes the norm ||Ax − b||. As the vector x varies over all values, the product Ax varies over the complete column space of A, i.e. the subspace of Rm spanned by the columns of A. The task is to find the closest vector to b that lies in the column space of A. (IIT Kharagpur) Minimization Jan ’10 7 / 35
  8. 8. Linear least-squares using normal equations Let x be the solution to this problem. Thus Ax is the closest point to b. In this case, the difference Ax − b must be orthogonal to the column space of A. This means that Ax − b is perpendicular to each of the columns of A, hence T A (Ax − b) = 0 (A T A)x = A T b The solution is given as: x = (A T A)−1 A T b x = A+ b A + = (A T A)−1 A TThe pseudo-inverse of matrix A using SVD is given as A + = VD+ U T (IIT Kharagpur) Minimization Jan ’10 8 / 35
  9. 9. Least-squares solution ofhomogeneous equationsSolving a set of equations of the form Ax = 0. x has the homogeneous representation. Hence if x is a solution, then k x is also a solution. A reasonable constraint would be to seek a solution for which ||x|| = 1 In general, such a set of equations will not have an exact solution. The problem is to find x that minimizes ||Ax|| subject to ||x|| = 1 (IIT Kharagpur) Minimization Jan ’10 9 / 35
  10. 10. Least-squares solution ofhomogeneous equations Let A = UDV T We need to minimize ||UDV T x||. Note that ||UDV T x|| = ||DV T x|| so we need to minimize ||DV T x|| Note that ||x|| = ||V T x|| so we have the condition that ||V T x|| = 1 Let y = V T x, so we minimize ||Dy|| subject to ||y || = 1  0    Since D is a diagonal matrix with its diagonal    0      entries in descending order. y= .     .      It follows that the solution to this problem is  .      T x, x = Vy is simply the last  0    Since y = V     1   column of V (IIT Kharagpur) Minimization Jan ’10 10 / 35
  11. 11. Iterative estimation techniques X = f(P) X is a measurement vector in RN P is a parameter vector in RM . We wish to seek the vector P satisfying X = f(P) − for which || || is minimized. The linear least squares problem is exactly of this type with the function f being defined as a linear function f(P) = AP (IIT Kharagpur) Minimization Jan ’10 11 / 35
  12. 12. Iterative estimation methods If the function f is not a linear function we use iterative estimation techniques. (IIT Kharagpur) Minimization Jan ’10 12 / 35
  13. 13. Iterative estimation methodsWe start with an initial estimated value P0 , and proceed to refinethe estimate under the assumption that the function f is locallylinear. Let 0 = f(P0 ) − XWe assume that the function is approximated at P0 by f(P0 + ∆) = f(P0 ) + J∆J is the linear mapping represented by the Jacobian matrix J = ∂f/∂P (IIT Kharagpur) Minimization Jan ’10 13 / 35
  14. 14. Iterative estimation methodsWe seek a point f(P1 ), with P1 = P0 + ∆, which minimizes f(P1 ) − X = f(P0 ) + J∆ − X = 0 + J∆Thus it is required to minimize || 0 + J∆|| over ∆, which is a linearminimization problem.The vector ∆ is obtained by solving the normal equations T J J∆ = −J T 0 ∆ = −J+ 0 (IIT Kharagpur) Minimization Jan ’10 14 / 35
  15. 15. Iterative estimation methodsThe solution vector P is obtained by starting with an estimate P0and computing successive approximations according to theformula Pi+1 = Pi + ∆iwhere ∆i is the solution to the linear least-squares problem J∆ =− iMatrix J is the Jacobian ∂f/∂P evaluated at Pi and i = f(Pi ) − X.The algorithm converges to a least squares solution P.Convergence can take place to a local minimum, or there may beno convergence at all. (IIT Kharagpur) Minimization Jan ’10 15 / 35
  16. 16. Newton’s method We consider finding minima of functions of many variables. Consider an arbitrary scalar-valued function g(P) where P is a vector. The optimization problem is simply to minimize g(P) over all values of P. Expand g(P) about P0 in a Taylor series to get g(P0 + ∆) = g + gP ∆ + ∆T gPP ∆/2 + . . . where gP denotes the differentiation of g(P) with respect to P, where gPP denotes the differentiation of gP with respect to P. (IIT Kharagpur) Minimization Jan ’10 16 / 35
  17. 17. Newton’s method Expand g(P) about P0 in a Taylor series to get g(P0 + ∆) = g + gP ∆ + ∆T gPP ∆/2 + . . . Differentiating the Taylor series with respect to ∆ we get gP + gPP ∆ = 0 ∆ = −gP /gPP Hessian matrix: gPP is the matrix of second derivatives, the Hessian of g. The (i, j)th entry is ∂2 g/∂pi ∂pj , and pi and pj are the i th and j th parameters. Vector gP is the gradient of g. The method of Newton iteration consists in starting with an initial value of the parameters, P0 and iteratively computing parameter increments ∆ until convergence occurs. (IIT Kharagpur) Minimization Jan ’10 17 / 35
  18. 18. Gauss Newton Method Consider a special case when gP is a squared norm of an error function. 1 (P)T (P) g(P) = || (P)||2 = 2 2 (P) = f(P) − X (P) is the error function (P) = f(P) − X (P) is a vector valued function of the parameter P ∂g(P) T The gradient gP = P ∂P ∂ (P) ∂f(P) where P = = fP ∂P ∂P We know that fP = J, ∴ P = J hence we have gP = J T (IIT Kharagpur) Minimization Jan ’10 18 / 35
  19. 19. Gauss Newton Method Consider the second derivative gPP T T T gP = P therefore gPP = P P + PP Since P = fP , and assuming that f(P) is linear, PP vanishes. T gPP = P P = JT J We have got an approximation of the 2nd derivative gPP . Now using the Newton’s equation gPP ∆ = −gP we get J T J∆ = −J T This is the Gauss-Newton method, in which we use an approximation of the Hessian gPP = J T J of the function g(P). (IIT Kharagpur) Minimization Jan ’10 19 / 35
  20. 20. Gradient Descent T The gradient of g(P) is given as gP = P T The negative gradient vector −gP = − P defines the direction of most rapid decrease of the cost function. Gradient descent is a strategy of minimization of g where we move iteratively in the gradient direction. We take small steps in the direction of descent. −gP ∆= where λ controls the length of the step λ Recall that in Newton’s method, the step size is given by −g∆ ∆= Hessian approximated by scalar matrix λI gPP (IIT Kharagpur) Minimization Jan ’10 20 / 35
  21. 21. Gradient Descent Gradient descent by itself is not a very good minimization strategy, typically characterized by slow convergence due to zig-zagging. However Gradient descent can be quite useful in conjunction with Gauss-Newton iteration as a way of getting out of tight corners. Levenberg-Marquardt method is essentially a Gauss-Newton method that transitions smoothly to gradient descent when the Gauss-Newton updates fail. (IIT Kharagpur) Minimization Jan ’10 21 / 35
  22. 22. Summaryg(P) is an arbitrary scalar valued function. g(P) = (P)T (P)/2 Newton’s Method Gauss Newton Gradient Descent T T gPP ∆ = −gP P P∆ =− P λ∆ = − T = −gP where The Hessian is The Hessian is T gPP = P P + PP T approximated as replaced by λI T T and gP = P The P P cost function is approximated as quadratic near the minimum. (IIT Kharagpur) Minimization Jan ’10 22 / 35
  23. 23. Levenberg-Marquardt iteration LM This is a slight variation of the Gauss-Newton iteration method. We have the augmented normal equations: T J J∆ = −J T −→ (J T J + λI)∆ = −J T The value of λ varies from iteration to iteration. A typical initial value of λ is 10−3 times the average of the diagonal elements of J T J (IIT Kharagpur) Minimization Jan ’10 23 / 35
  24. 24. Levenberg-Marquardt iteration LM If the value of ∆ If the value of ∆ leads to an obtained by solving the increased error, then λ is multiplied by augmented normal the same factor and the augmented equations leads to normal equations are solved again. reduction of error, then This process continues until a value the increment is of ∆ is found that gives rise to a accepted and λ is decreased error. divided by a factor (typically 10) before the next iteration. The process of repeatedly solving the augmented normal equations for different values of λ until an acceptable ∆ is found constitutes one iteration of the LM algorithm. (IIT Kharagpur) Minimization Jan ’10 24 / 35
  25. 25. Robust cost functions Squared Error (convex) PDF Attenuation function (IIT Kharagpur) Minimization Jan ’10 25 / 35
  26. 26. Robust cost functionsBlake Zisserman (non-convex) PDF Attenuation functionCorrupted Gaussian (non-convex) PDF Attenuation function (IIT Kharagpur) Minimization Jan ’10 26 / 35
  27. 27. Robust cost functionsCauchy (non-convex) PDF Attenuation function L1 cost (convex) PDF Attenuation function (IIT Kharagpur) Minimization Jan ’10 27 / 35
  28. 28. Robust cost functions Huber (convex) PDF Attenuation functionPseudo Huber (convex) PDF Attenuation function (IIT Kharagpur) Minimization Jan ’10 28 / 35
  29. 29. Square Error cost function C(δ) = δ2 PDF = exp(−C(δ)) Its main drawback is that it is not robust to outliers in the measurements. Because of the rapid growth of the quadratic curve, distant outliers exert an excessive influence, and can draw the cost minimum well away from the desired value. The squared-error cost function is generally very susceptible to outliers, and may be regarded as unusable as long as outliers are present. If outliers have been thoroughly eradicated, using for instance RANSAC, then it may be used. (IIT Kharagpur) Minimization Jan ’10 29 / 35
  30. 30. Non-convex cost functions The Blake-Zisserman, corrupted Gaussian and Cauchy cost functions seek to mitigate the deleterious effect of outliers by giving them diminished weight. As is seen in the plot of the first two of these, once the error exceeds a certain threshold, it is classified as an outlier, and the cost remains substantially constant. The Cauchy cost function also seeks to deemphasize the cost of outliers, but this is done more gradually. (IIT Kharagpur) Minimization Jan ’10 30 / 35
  31. 31. Asymptotically Linear cost functions The L1 cost function measures the absolute value of the error. The main effect of this is to give outliers less weight compared with the squared error. This cost function acts to find the median of a set of data. Consider a set of real valued data {ai } and a cost function defined by C(x) = i |x − ai | The minimum of this function is at the median of the set {ai }. For higher dimensional data ai ∈ Rn , the minimum of the cost function C(x) = i ||x − ai || similar stability properties with regard to outliers. (IIT Kharagpur) Minimization Jan ’10 31 / 35
  32. 32. Huber Cost function The Huber cost function takes the form of a quadratic for small values of the error, δ, and becomes linear for values of δ beyond a given threshold. It retains the outlier stability of the L1 cost function, while for inliers it reflects the property that the squared-error cost function gives the Maximum Likelihood estimate. (IIT Kharagpur) Minimization Jan ’10 32 / 35
  33. 33. Non-convex Cost functions The non-convex cost functions, though generally having a stable minimum, not much effected by outliers have the significant disadvantage of having local minima, which can make convergence to a global minimum chancy. The estimate is not strongly attracted to the minimum from outside of its immediate neighbourhood. Thus, they are not useful, unless (or until) the estimate is close to the final correct value. (IIT Kharagpur) Minimization Jan ’10 33 / 35
  34. 34. Maximum Likelihood method Maximum likelihood is the procedure of finding the value of one or more parameters for a given statistic which makes the known likelihood distribution a maximum. The maximum likelihood estimate for a parameter µ is denoted µ. ˆ n 1 √ e(xi −µ) /2σ 2 2 f (x1 , x2 , . . . , xn |µσ) = i=1 σ 2π (2π)−n/2 (xi − µ)2 = exp − σn 2σ2 Taking the logarithm 1 (xi − µ)2 log f = − n log(2π) − n log σ − 2 2σ2 (IIT Kharagpur) Minimization Jan ’10 34 / 35
  35. 35. To maximize the log likelihood ∂(log f ) (xi − µ) xi = = 0 giving µ = ˆ ∂µ σ2 n Similarly ∂(log f ) n (xi − µ)2 µ (xi −ˆ)2 =− + = 0 giving σ = ˆ ∂σ σ σ3 n Minimizing the least squares cost function gives a result which is equivalent to the maximum likelihood estimate assuming Gaussian distribution.In general, the maximum likelihood estimate of the parameter vector θis given as ˆ θML = arg max p(x|θ) θ (IIT Kharagpur) Minimization Jan ’10 35 / 35

×