Algorithms computer vision

  • 221 views
Uploaded on

Algorithms computer vision

Algorithms computer vision

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
221
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Algorithms booklet December 10, 2012
  • 2. 2 Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 3. Algorithms booklet This document accompanies the book “Computer vision: models, learning, and inference” by Simon J.D. Prince. It contains concise descriptions of almost all of the models and algorithms in the book. The goal is to provide sufficient information to implement a naive version of each method. This information was published separately from the main book because (i) it would have impeded the clarity of the main text and (ii) on-line publishing means that I can update the text periodically and eliminate any mistakes. In the main, this document uses the same notation as the main book (see Appendix A for a summary). In addition, we also use the following conventions: • When two matrices are concatenated horizontally, we write C = [A, B]. • When two matrices are concatenated vertically, we write C = [A; B]. • The function argminx f [x] returns the value of the argument x that minimizes f [x]. If x is discrete then this should be done by exhaustive search. If x is continuous, then it should be done by gradient descent and I usually supply the gradient and Hessian of the function to help with this. • The function δ[x] for discrete x returns 1 when the argument x is 0 and returns 0 otherwise. • The function diag[A] returns a column vector containing the elements on the diagonal of matrix A. • The function zeros[I, J] creates an I × J matrix that is full of zeros. As a final note, I should point out that this document has not yet been checked very carefully. I’m looking for volunteers to help me with this. There are two main ways you can help. First, please mail me at s.prince@cs.ucl.ac.uk if you manage to successfully implement one of these methods. That way I can be sure that the description is sufficient. Secondly, please also mail me if you if you have problems getting any of these methods to work. It’s possible that I can help, and it will help me to identify ambiguities and errors in the descriptions. Simon Prince Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 4. 4 Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 5. List of Algorithms 4.1 4.2 4.3 4.4 4.5 4.6 6.1 7.1 7.2 7.3 8.1 8.2 8.3 8.4 8.5 8.6 8.7 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.1 10.2 11.1 11.2 11.3 11.4 11.4b 12.1 12.2 Maximum likelihood learning for normal distribution . . . . . . . . . . MAP learning for normal distribution with conjugate prior . . . . . . . Bayesian approach to normal distribution . . . . . . . . . . . . . . . . . Maximum likelihood learning for categorical distribution . . . . . . . . MAP learning for categorical distribution with conjugate prior . . . . . Bayesian approach to categorical distribution . . . . . . . . . . . . . . . Basic Generative classifier . . . . . . . . . . . . . . . . . . . . . . . . . . Maximum likelihood learning for mixtures of Gaussians . . . . . . . . . Maximum likelihood learning for t-distribution . . . . . . . . . . . . . . Maximum likelihood learning for factor analyzer . . . . . . . . . . . . . Maximum likelihood learning for linear regression . . . . . . . . . . . . Bayesian formulation of linear regression. . . . . . . . . . . . . . . . . . Gaussian process regression. . . . . . . . . . . . . . . . . . . . . . . . . Sparse linear regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . Dual formulation of linear regression. . . . . . . . . . . . . . . . . . . . Dual Gaussian process regression. . . . . . . . . . . . . . . . . . . . . . Relevance vector regression. . . . . . . . . . . . . . . . . . . . . . . . . . Cost and derivatives for MAP logistic regression . . . . . . . . . . . . . Bayesian logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . Cost and derivatives for MAP dual logistic regression . . . . . . . . . . Dual Bayesian logistic regression . . . . . . . . . . . . . . . . . . . . . . Relevance vector classification . . . . . . . . . . . . . . . . . . . . . . . Incremental logistic regression . . . . . . . . . . . . . . . . . . . . . . . Logitboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cost function, derivative and Hessian for multi-class logistic regression . Multiclass classification tree . . . . . . . . . . . . . . . . . . . . . . . . Gibbs’ sampling from undirected model . . . . . . . . . . . . . . . . . . Contrastive divergence learning of undirected model . . . . . . . . . . . Dynamic programming in chain . . . . . . . . . . . . . . . . . . . . . . Dynamic programming in tree . . . . . . . . . . . . . . . . . . . . . . . Forward backward algorithm . . . . . . . . . . . . . . . . . . . . . . . . Sum product: distribute . . . . . . . . . . . . . . . . . . . . . . . . . . . Sum product: collate and compute marginal distributions . . . . . . . . Binary graph cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reparameterization for binary graph cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com. 7 7 8 8 9 9 10 11 12 13 14 15 16 17 18 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
  • 6. 6 LIST OF ALGORITHMS 12.3 12.4 12.4b 13.1 13.2 14.1 14.2 14.3 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8 15.9 15.10 16.1 16.2 16.3 16.4 17.1 17.2 18.1 18.2 18.3 18.4 19.1 19.2 19.3 19.4 19.5 19.6 20.1 20.2 20.2b Multilabel graph cuts . . . . . . . . . . . . . . . . . . . . . . Alpha expansion algorithm (main loop) . . . . . . . . . . . . Alpha expansion (expand) . . . . . . . . . . . . . . . . . . . Principal components analysis (dual) . . . . . . . . . . . . . K-means algorithm . . . . . . . . . . . . . . . . . . . . . . . ML learning of extrinsic parameters . . . . . . . . . . . . . . ML learning of intrinsic parameters . . . . . . . . . . . . . . Inferring 3D world position . . . . . . . . . . . . . . . . . . . Maximum likelihood learning of Euclidean transformation . . Maximum likelihood learning of similarity transformation . . Maximum likelihood learning of affine transformation . . . . Maximum likelihood learning of projective transformation . . Maximum likelihood inference for transformation models . . ML learning of extrinsic parameters (planar scene) . . . . . . ML learning of intrinsic parameters (planar scene) . . . . . . Robust ML learning of homography . . . . . . . . . . . . . . Robust sequential learning of homographies . . . . . . . . . . PEaRL learning of homographies . . . . . . . . . . . . . . . . Extracting relative camera position from point matches . . . Eight point algorithm for fundamental matrix . . . . . . . . Robust ML fitting of fundamental matrix . . . . . . . . . . . Planar rectification . . . . . . . . . . . . . . . . . . . . . . . Generalized Procrustes analysis . . . . . . . . . . . . . . . . ML learning of PPCA model . . . . . . . . . . . . . . . . . . Maximum likelihood learning for identity subspace model . . Maximum likelihood learning for PLDA model . . . . . . . . Maximum likelihood learning for asymmetric bilinear model Style translation with asymmetric bilinear model . . . . . . . The Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . Fixed interval Kalman smoother . . . . . . . . . . . . . . . . The extended Kalman filter . . . . . . . . . . . . . . . . . . . The iterated extended Kalman filter . . . . . . . . . . . . . . The unscented Kalman filter . . . . . . . . . . . . . . . . . . The condensation algorithm . . . . . . . . . . . . . . . . . . Learn bag of words model . . . . . . . . . . . . . . . . . . . Learn latent Dirichlet allocation model . . . . . . . . . . . . MCMC Sampling for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com. 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
  • 7. Fitting probability distributions 7 Algorithm 4.1: Maximum likelihood learning of normal distribution The univariate normal distribution is a probability density model suitable for describing continuous data x in one dimension. It has pdf P r(x) = √ 1 2πσ 2 exp −0.5(x − µ)2 /σ 2 , where the parameter µ denotes the mean and σ 2 denotes the variance. Algorithm 4.1: Maximum likelihood learning for normal distribution Input : Training data {xi }I i=1 Output: Maximum likelihood estimates of parameters θ = {µ, σ 2 } begin // Set mean parameter µ = I xi /I i=1 // Set variance σ 2 = I (xi − µ)2 /I ˆ i=1 end Algorithm 4.2: MAP learning of univariate normal parameters The conjugate prior to the normal distribution is the normal-scaled inverse gamma which has pdf P r(µ, σ 2 ) = √ γ βα √ σ 2π Γ(α) 1 σ2 α+1 exp − 2β + γ(δ − µ)2 , 2σ 2 with hyperparameters α, β, γ > 0 and δ ∈ [−∞, ∞]. Algorithm 4.2: MAP learning for normal distribution with conjugate prior Input : Training data {xi }I , Hyperparameters α, β, γ, δ i=1 Output: MAP estimates of parameters θ = {µ, σ 2 } begin // Set mean parameter µ = ( i=1 xi + γδ)/(I + γ) // Set variance σ 2 = ( I (xi − µ)2 + 2β + γ(δ − µ)2 )/(I + 3 + 2α) i=1 end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 8. 8 Fitting probability distributions Algorithm 4.3: Bayesian approach to univariate normal distribution In the Bayesian approach to fitting the univariate normal distribution we again use a normalscaled inverse gamma prior. In the learning stage we compute a normal inverse gamma distribution over the mean and variance parameters. The predictive distribution for a new datum is computed by integrating the predictions for a given set of parameters weighted by the probability of those parameters being present. Algorithm 4.3: Bayesian approach to normal distribution Input : Training data {xi }I , Hyperparameters α, β, γ, δ, Test data x∗ i=1 Output: Posterior parameters {α, β, γ , δ}, predictive distribution P r(x∗ |x1...I ) ˜ ˜ ˜ ˜ begin // Compute normal inverse gamma posterior over normal parameters α = α + I/2 ˜ ˜ β = i x2 /2 + β + γδ 2 /2 − (γδ + i xi )2 /(2γ + 2I) i γ =γ+I ˜ ˜ δ = (γδ + i xi )/(γ + I) // Compute intermediate parameters α = α + 1/2 ˘ ˜ ˘ = x∗2 /2 + β + γ δ 2 /2 − (˜ δ + x∗ )2 /(2˜ + 2) ˜ ˜˜ β γ˜ γ γ =γ+1 ˘ ˜ // Evaluate new datapoint under predictive distribution √ √ α √ ˜˜ P r(x∗ |x1...I ) = γ β α Γ[α]/ ˜ ˘ 2π γ β ˘ Γ[α] ˘˘ ˜ end Algorithm 4.4: ML learning of categorical parameters The categorical distribution is a probability density model suitable for describing discrete multivalued data x ∈ {1, 2, . . . K}. It has pdf P r(x = k) = λk , where the parameter λk denotes the probability of observing category k. Algorithm 4.4: Maximum likelihood learning for categorical distribution Input : Multi-valued training data {xi }I i=1 Output: ML estimate of categorical parameters θ = {λ1 . . . λk } begin for k=1 to K do λk = I δ[xi − k]/I i=1 end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 9. Fitting probability distributions 9 Algorithm 4.5: MAP learning of categorical parameters For MAP learning of the categorical parameters, we need to define a prior and to this end, we choose the Dirichlet distribution: P r(λ1 . . . λK ) = Γ[ K k=1 K k=1 αk ] Γ[αk ] K α λk k −1 , k=1 where Γ[•] is the Gamma function and {αk }K are hyperparameters. k=1 Algorithm 4.5: MAP learning for categorical distribution with conjugate prior Input : Binary training data {xi }I , Hyperparameters {αk }K i=1 k=1 Output: MAP estimates of parameters θ = {λk }K k=1 begin for k=1 to K do Nk = I δ[xi − k]) i=1 λk = (Nk − 1 + αk )/(I − K + K αk ) k=1 end end Algorithm 4.6: Bayesian approach to categorical distribution In the Bayesian approach to fitting the categorical distribution we again use a Dirichlet prior. In the learning stage we compute a probability distribution over K categorical parameters, which is also a Dirichlet distribution. The predictive distribution for a new datum is based a weighted sum of the predictions for all possible parameter values where the weights used are based on the Dirichlet distribution computed in the learning stage. Algorithm 4.6: Bayesian approach to categorical distribution Input : Categorical training data {xi }I , Hyperparameters {αk }K i=1 k=1 Output: Posterior parameters {αk }K , predictive distribution P r(x∗ |x1...I ) ˜ k=1 begin // Compute caterorical posterior over λ for k=1 to K do αk = αk + I δ[xi − k] ˜ i=1 end // Evaluate new datapoint under predictive distribution for k=1 to K do P r(x∗ = k|x1...I ) = αk /( K α˜ ) ˜ m=1 m end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 10. 10 Learning and inference in vision Algorithm 6.1: Basic generative classifier Consider the situation where we wish to assign a label w ∈ {1, 2, . . . K} based on an observed multivariate measurement vector xi . We model the class conditional density functions as normal distributions so that P r(xi |wi = k) = Normxi [µk , Σk ], with prior probabilities over the world state defined by P r(wi ) = Catwi [λ]. 2 In the learning phase, we fit the parameters µk and σk of the k th class conditional density function P r(xi |wi = k) from just the subset of data Sk = {xi : wi = k} where the k th state was observed. We learn the prior parameter λ from the training world states {wi }I . Here i=1 we have used the maximum likelihood approach in both cases. The inference algorithm takes new datum x and returns the posterior P r(w∗ |x∗ , θ) over the world state w using Bayes’ rule, P r(w∗ |x∗ ) = P r(x∗ |w∗ )P r(w∗ ) K w∗ =1 P r(x∗ |w∗ )P r(w∗ ) . Algorithm 6.1: Basic Generative classifier Input : Training data {xi , wi }I , new data example x∗ i=1 Output: ML parameters θ = {λ1...K , µ1...K , Σ1...K }, posterior probability P r(w∗ |x∗ ) begin // For each training class for k=1 to K do // Set mean µk = ( I xi δ[wi − k])/( I δ[wi − k]) i=1 i=1 // Set variance ˆ ˆ Σk = ( I (xi − µ)(xi − µ)T δ[wi − k])/( I δ[wi − k]) i=1 i=1 // Set prior λk = I δ[wi − k]/I i=1 end // Compute likelihoods for each class for a new datapoint for k=1 to K do lk = Normx∗ [µk , Σk ] end // Classify new datapoint using Bayes’ rule for k=1 to K do P r(w∗ = k|x∗ ) = lk λk / K lm λm m=1 end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 11. Modelling complex densities 11 Algorithm 7.1: Fitting mixture of Gaussians The mixture of Gaussians (MoG) is a probability density model suitable for data x in D dimensions. The data is described as a weighted sum of K normal distributions K P r(x|θ) = λk Normx [µk , Σk ], k=1 where µ1...K and Σ1...K are the means and covariances of the normal distributions and λ1...K are positive valued weights that sum to one. The MoG is fit using the EM algorithm. In the E-step, we compute the posterior distribution over a hidden variable hi for each observed data point xi . In the M-step, we iterate through the K components, updating the mean µk and Σk for each and also update the weights {λk }K . k=1 Algorithm 7.1: Maximum likelihood learning for mixtures of Gaussians Input : Training data {xi }I , number of clusters K i=1 Output: ML estimates of parameters θ = {λ1...K , µ1...K , Σ1...K } begin Initialize θ = θ 0 a repeat // Expectation Step for i=1 to I do for k=1 to K do lik = λk Normxi [µk , Σk ] // numerator of Bayes’ rule end // Compute posterior (responsibilities) by normalizing for k=1 to K do rik = lik /( K lik ) k=1 end end // Maximization Step b for k=1 to K do [t+1] I λk = ( I rik )/( K i=1 k=1 i=1 rik ) [t+1] I I µk = ( i=1 rik xi )/( i=1 rik ) [t+1] [t+1] [t+1] Σk = ( I rik (xi − µk )(xi − µk )T )/( I rik ). i=1 i=1 end // Compute Data Log Likelihood and EM Bound L= I K i=1 log k=1 λk Normxi [µk , Σk ] I K i=1 k=1 rik log [λk Normxi [µk , Σk ]/rik ] B= until No further improvement in L end a One possibility is to set the weights λ• = 1/K, the means µ• to the values of K randomly chosen datapoints and the variances Σ• to the variance of the whole dataset. b For a diagonal covariance retain only the diagonal of the Σk update. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 12. 12 Modelling complex densities Algorithm 7.2: Fitting the t-distribution The t-distribution is a robust (long-tailed) distribution with pdf P r(x) = ν+D 2 (νπ)D/2 |Σ|1/2 Γ Γ ν 2 1+ (x − µ)T Σ−1 (x − µ) ν −(ν+D)/2 . where µ is the mean of the distribution Σ is a matrix that controls the spread, ν is the degrees of freedom, and D is the dimensionality of the input data. We use the EM algorithm to fit the parameters θ = {µ, Σ, ν}. In the E-step, we compute the gamma-distributed posterior over the hidden variable hi for each observed data point xi . In the M-step we update the parameters µ and Σ in closed form, but must perform an explicit line search to update ν using the criterion: tCost ν, {E[hi ], E[log[hi ]]}I i=1 = I − i=1 ν ν ν log + log Γ 2 2 2 − ν ν − 1 E[log[hi ]] + E[hi ]. 2 2 Algorithm 7.2: Maximum likelihood learning for t-distribution Input : Training data {xi }I i=1 Output: Maximum likelihood estimates of parameters θ = {µ, Σ, ν} begin Initialize θ = θ 0 a repeat // Expectation step for i=1 to I do δi = (xi − µ)T Σ−1 (xi − µ) E[hi ] = (ν + D)/(ν + δi ) E[log[hi ] = Ψ[ν/2 + D/2] − log[ν/2 + δi /2] end // Maximization step µ = ( I E[hi ]xi )/( I E[hi ]) i=1 i=1 Σ = ( I E[hi ](xi − µ)(xi − µ)T )/( I E[hi ]) i=1 i=1 ν = argminν [tCost[ν, {E[hi ], E[log[hi ]]}I ]] i=1 // Compute data log Likelihood for i=1 to I do δi = (xi − µ)T Σ−1 (xi − µ) end L = I log[Γ[(ν + D)/2]] − ID log[νπ]/2 − I log[|Σ|]/2 − I log[Γ[ν/2]] L = L − (ν + D) I log[1 + δi /ν]/2 i=1 until No further improvement in L end a One possibility is to initialize the parameters µ and Σ to the mean and variance of the distribution and set the initial degrees of freedom to a large value say ν = 1000. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 13. Modelling complex densities 13 Algorithm 7.3: Fitting a factor analyzer The factor analyzer is a probability density model suitable for data x in D dimensions. It has pdf P r(xi |θ) = Normx∗ [µ, ΦΦ + Σ], where µ is a D × 1 mean vector, Φ is a D × K matrix containing the K factors {φ}K in its k=1 columns and Σ is a diagonal matrix of size D × D. The factor analyzer is fit using the EM algorithm. In the E-step, we compute the posterior distribution over the hidden variable hi for each data example xi and extract the expectations E[hi ] and E[hi hT ]. In the M-step, we use these distributions in closed-form updates for the i basis function matrix Φ and the diagonal noise term Σ. Algorithm 7.3: Maximum likelihood learning for factor analyzer Input : Training data {xi }I , number of factors K i=1 Output: Maximum likelihood estimates of parameters θ = {µ, Φ, Σ} begin Initialize θ = θ 0 a // Set mean µ = I xi /I i=1 repeat // Expectation Step for i=1 to I do E[hi ] = (ΦT Σ−1 Φ + I)−1 ΦT Σ−1 (xi − µ) E[hi hT ] = (ΦT Σ−1 Φ + I)−1 + E[hi ]E[hi ]T i end // Maximization Step Φ= I i=1 (xi − µ)E[hi ]T I i=1 E[hi hT ] i −1 Σ = diag (xi − µ)(xi − µ)T − ΦE[hi ](xT − µ) /I i // Compute Data Log Likelihoodb L = I log Normxi [µ, ΦΦT + Σ] i=1 until No further improvement in L end a It is usual to initialize Φ to random values. The D diagonal elements of Σ can be initialized to the variances of the D data dimensions. b In high dimensions it is worth reformulating the covariance of this matrix using the ShermanMorrison-Woodbury relation (matrix inversion lemma) . Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 14. 14 Models for regression Algorithm 8.1: ML fitting of linear regression model The linear regression model describes the world w as a normal distribution. The mean of this distribution is a linear function φ0 + φT x and the variance is constant. In practice we add a 1 to the start of every data vector xi ← [1 xT ]T and attach the y-intercept φ0 to the start i of the gradient vector φ ← [φ0 φT ]T and write P r(wi |xi , θ) = Normwi φT xi , σ 2 . In the learning algorithm, we work with the matrix X = [x1 , x2 . . . xI ] which contains all of the training data examples in its columns and the world vector w = [w1 , w2 . . . wI ]T which contains the training world states. Algorithm 8.1: Maximum likelihood learning for linear regression Input : (D + 1)×I data matrix X, I ×1 world vector w Output: Maximum likelihood estimates of parameters θ = {Φ, σ 2 } begin // Set gradient parameter φ = (XXT )−1 Xw // Set variance parameter σ 2 = (w − XT φ)T (w − XT φ)/I end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 15. Models for regression 15 Algorithm 8.2: Bayesian linear regression In Bayesian linear regression we define a normal prior over the parameters φ 2 P r(φ) = Normφ [0, σp I], 2 which contains one hyperparameter σp which determines the prior variance. We compute a distribution over possible parameters φ and use this to evaluate the mean µw∗ |x∗ and variance 2 σw∗ |x∗ of the predictive distribution for new data x∗ . As in the previous algorithm, we add a 1 to the start of every data vector xi ← [1 xT ]T i and then work with the matrix X = [x1 , x2 . . . xI ] which contains all of the training data examples in its columns and the world vector w = [w1 , w2 . . . wI ]T which contains the training world states. The choice of approach depends on whether the number of data examples I is greater or less than the dimensionality D of the data. Depending on which case which situation we are in we move to a situation where we invert the (D + 1) × (D + 1) matrix XXT or the I × I matrix XT X. Algorithm 8.2: Bayesian formulation of linear regression. 2 Input : (D + 1)×I data matrix X, I ×1 world vector w, Hyperparameter σp , ∗ ∗ ∗ Output: Distribution P r(w |x ) over world given new data example x begin // If dimensions D less than number of data examples I if D < I then // Fit variance parameter σ 2 with line search 2 σ 2 = argminσ2 − log[Normw [0, σp XT X + σ 2 I]] a // Compute inverse variance of posterior distribution over φ 2 A−1 = (XXT /σ 2 + I/σp )−1 else // Fit variance parameter σ 2 with line search 2 σ 2 = argminσ2 − log[Normw [0, σp XT X + σ 2 I]] // Compute inverse variance of posterior distribution over φ −1 2 2 2 A−1 = σp I − σp X XT X + (σ 2 /σp )I XT end // Compute mean of prediction for new example x∗ µw∗ |x∗ = x∗T A−1 Xw/σ 2 // Compute variance of prediction for new example x∗ 2 σw∗ |x∗ = x∗T A−1 x∗ + σ 2 end a To compute this cost function when the dimensions D < I we need to compute both the inverse and determinant of the covariance matrix. It is inefficient to implement this directly as the covariance is I × I. To compute the inverse, the covariance should be reformulated using the matrix inversion lemma and the determinant calculated using the matrix determinant lemma. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 16. 16 Models for regression Algorithm 8.3: Gaussian process regression To compute a non-linear fit to a set of data, we first transform the data x by a non-linear function f [•] to create a new variable z = f [xi ]. We then proceed as normal with the Bayesian approach, but using the transformed data. In practice, we exploit the fact that the Bayesian non-linear regression fitting and prediction algorithms can be described in terms of inner products zT z of the transformed data. We hence directly define a single kernel function k[xi , xj ] as a replacement for the operation f [xi ]T f [xj ]. For many transformations f [•] it is more efficient to evaluate the kernel function directly than to transform the variables separately and then compute the dot product. It is further possible to choose kernel functions that correspond to projection to very high or even infinite dimensional spaces without ever having to explicitly compute this transformation. As usual we add a 1 to the start of every data vector xi ← [1 xT ]T and then work with i the matrix X = [x1 , x2 . . . xI ] which contains all of the training data examples in its columns and the world vector w = [w1 , w2 . . . wI ]T which contains the training world states. In this algorithm, we use the notation K[A, B] to denote the DA × DB matrix containing all of the inner products of the DA columns of A with the DB columns of B. Algorithm 8.3: Gaussian process regression. 2 Input : (D+1)×I data matrix X, I ×1 world vector w, hyperparameter σp ∗ ∗ Output: Normal distribution P r(w |x ) over world given new data example x∗ begin // Fit variance parameter σ 2 with line search 2 σ 2 = argminσ2 − log[Normw [0, σp K[X, X] + σ 2 I]] // Compute inverse term −1 2 A−1 = K[X, X] + (σ 2 /σp )I // Compute mean of prediction for new example x∗ 2 2 µw∗ |x∗ = (σp /σ 2 )K[x∗ , X]w − (σ 2 /σp )K[x∗ , X]A−1 K[X, X]w // Compute variance of prediction for new example x∗ 2 2 2 σw∗ |x∗ = σp K[x∗ , x∗ ] − σp K[x∗ , X]A−1 K[X, x∗ ] + σ 2 end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 17. Models for regression 17 Algorithm 8.4: Sparse linear regression In the sparse linear regression model, we replace the normal prior over the parameters with a prior that is a product of t-distributions. This favours solutions where most of the regression parameters are effectively zero. In practice, the t-distribution corresponding to the dth dimension of the data is represented as a marginalization of a joint distribution with a hidden variable hd . The algorithm is iterative and alternates between updating the hidden variables in closed form and performing a line search for the noise parameters σ 2 . After the system has converged, we prune the model to remove dimensions where the hidden variable was large (>1000 is a reasonable criterion); these dimensions contribute very little to the final prediction. Algorithm 8.4: Sparse linear regression. Input : (D + 1)×I data matrix X, I ×1 world vector w, degrees of freedom, ν Output: Distribution P r(w∗ |x∗ ) over world given new data example x∗ begin // Initialize variables H = diag[1, 1, . . . 1] repeat // Maximize marginal likelihood w.r.t. variance parameter σ 2 = argminσ2 − log[Normw [0, XT H−1 X + σ 2 I]] // Maximize marginal likelihood w.r.t. relevance parameters H Σ = σ 2 (XXT + H)−1 µ = ΣXw/σ 2 // For each dimension except the first (the constant) for d=2 to D + 1 do // Update the diagonal entry of H hdd = (1 − hdd Σdd + ν)/(µ2 + ν) d end until No further improvement // Remove columns of X, rows of w and rows and columns of H where value hdd on the diagonal of H is large [H, X, w] = prune[H, X, w] // Compute variance of posterior over Φ −1 A−1 = H−1 − H−1 X XT H−1 X + σ 2 I XT H−1 // Compute mean of prediction for new example x∗ µw∗ |x∗ = x∗T A−1 Xw/σ 2 // Compute variance of prediction for new example x∗ 2 σw∗ |x∗ = x∗T A−1 x∗ + σ 2 end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 18. 18 Models for regression Algorithm 8.5: Dual Bayesian linear regression In dual linear regression, we formulate the weight vector as a sum of the observed data examples X so that φ = Xψ and then solve for the dual parameters ψ. To this end we place a normally distributed prior on Ψ with a uniform covariance matrix with magnitude σp . Algorithm 8.5: Dual formulation of linear regression. 2 Input : (D + 1)×I data matrix X, I ×1 world vector w, Hyperparameter σp , ∗ ∗ ∗ Output: Distribution P r(w |x ) over world given new data example x begin // Fit variance parameter σ 2 with line search 2 σ 2 = argminσ2 − log[Normw [0, σp XT XXT X + σ 2 I]] // Compute inverse variance of posterior over Φ 2 A = XT XXT X/σ 2 + I/σp // Compute mean of prediction for new example x∗ µw∗ |x∗ = x∗T XA−1 XT Xw/σ 2 // Compute variance of prediction for new example x∗ 2 σw∗ |x∗ = x∗T XA−1 Xx∗ + σ 2 end Algorithm 8.6: Dual Gaussian process regression The dual algorithm relies only on inner products of the form xT x and so can be kernelized to form a non-linear regression method. As previously, we use the notation K[A, B] to denote the DA × DB matrix containing all of the inner products of the DA columns of A with the DB columns of B. Algorithm 8.6: Dual Gaussian process regression. 2 Input : (D + 1)×I data matrix X, I ×1 world vector w, Hyperparameter σp , Kernel Function K[•, •] Output: Distribution P r(w∗ |x∗ ) over world given new data example x∗ begin // Fit variance parameter σ 2 with line search 2 σ 2 = argminσ2 − log[Normw [0, σp K[X, X]K[X, X] + σ 2 I]] // Compute inverse term 2 A = K[X, X]K[X, X]/σ 2 + I/σp // Compute mean of prediction for new example x∗ µw∗ |x∗ = K[x∗ , X]A−1 K[X, X]w/σ 2 // Compute variance of prediction for new example x∗ 2 σw∗ |x∗ = K[x∗T , X]A−1 K[X, x∗ ] + σ 2 end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 19. Models for regression 19 Algorithm 8.7: Relevance vector regression Relevance vector regression is simply sparse linear regression applied in the dual situation; we encourage the dual parameters ψ to be sparse using a prior that is a product of t-distributions. Since there is one dual parameter for each of the I training examples, we introduce I hidden variables hi which control the tendency to be zero for each dimension. The algorithm is iterative and alternates between updating the hidden variables in closed form and performing a line search for the noise parameter σ 2 . After the system has converged, we prune the model to remove dimensions where the hidden variable was large (>1000 is a reasonable criterion); these dimensions contribute very little to the final prediction. Algorithm 8.7: Relevance vector regression. Input : (D+1)×I data matrix X, I ×1 world vector w, kernel K[•, •], degrees of freedom, ν Output: Distribution P r(w∗ |x∗ ) over world given new data example x∗ begin // Initialize variables H = diag[1, 1, . . . 1] repeat // Maximize marginal likelihood wrt variance parameter σ 2 σ 2 = argminσ2 − log[Normw [0, K[X, X]H−1 K[X, X] + σ 2 I]] // Maximize marginal likelihood wrt relevance parameters H Σ = (K[X, X]K[X, X]/σ 2 + H)−1 µ = ΣK[X, X]w/σ 2 // For each dual parameter for i=1 to I do // Update diagonal entry of H hdd = (1 − hdd Σii + ν)/(µ2 + ν) i end until No further improvement // Remove cols of X, rows of w, rows and cols of H where hdd is large [H, X, w] = prune[H, X, w] // Compute inverse term A = K[X, X]K[X, X]/σ 2 + H // Compute mean of prediction for new example x∗ µw∗ |x∗ = K[x∗ , X]A−1 K[X, X]w/σ 2 // Compute variance of prediction for new example x∗ 2 σw∗ |x∗ = K[x∗ , X]A−1 K[X, x∗ ] + σ 2 end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 20. 20 Models for classification Algorithm 9.1: MAP Logistic regression The logistic regression model is defined as P r(w|x, φ) = Bernw 1 , 1 + exp[−φT x] where as usual, we have attached a 1 to the start of each data example xi . We now perform a non-linear minimization over the negative log binomial probability with respect to the parameter vector φ: I ˆ φ = argmin − φ log Bernwi i=1 1 1 + exp[−φT xi ] 2 − log Normφ [0, σp I] , where we have also added a prior over the parameters φ. The MAP solution is superior to the maximum likelihood approach in that it encourages the function to be smooth even when the classes are completely separable. A typical approach would be to use a second order optimization method such as the Newton method (e.g., using Matlab’s fminunc function). The optimization method will need to compute the cost function and it’s derivative and Hessian with respect to the parameter φ. Algorithm 9.1: Cost and derivatives for MAP logistic regression Input : Binary world state {wi }I , observed data {xi }I , parameters φ i=1 i=1 Output: cost L, gradient g, Hessian H begin // Initialize cost, gradient, Hessian 2 L = L + (D + 1) log[2πσ 2 ]/2 + φT φ/(2σp ) 2 g = φ/σp 2 H = 1/σp // For each data point for i=1 to I do // Compute prediction y yi = 1/(1 + exp[−φT xi ]) // Add term to log likelihood if wi == 1 then L = L − log[yi ] else L = L − log[1 − yi ] end // Add term to gradient g = g + (yi − wi )xi // Add term to Hessian H = H + yi (1 − yi )xi xT i end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 21. Models for classification 21 Algorithm 9.2: Bayesian logistic regression In Bayesian logistic regression, we aim to compute the predictive distribution P r(w∗ |x∗ ) over the binary world state w∗ for a new data example x∗ . This takes the form of a Bernoulli distribution and is hence summarized by the single λ∗ = P r(w∗ = 1|x∗ ). The method works by first finding the MAP solution (using the cost function in the previous algorithm). It then builds a Laplace approximation based on this result and the Hessian at the MAP solution. Using the mean and variance of the Laplace approximation we can compute a probability distribution over the activation. We then use a further approximation to compute the integral over this distribution. As usual, we assume that we have added a one to the start of every data vector so that xi ← [1, xT ]T to model the offset parameter elegantly. i Algorithm 9.2: Bayesian logistic regression Input : Binary world state {wi }I , observed data {xi }I , new data x∗ i=1 i=1 Output: Predictive distribution P r(w∗ |x∗ ) begin // Optimization using cost function of algorithm 9.1 φ = argminφ − I i=1 2 log Bernwi [1/(1 + exp[−φT xi ])] − log Normφ [0, σp I] // Compute Hessian at peak 2 H = 1/σp for i=1 to I do yi = 1/(1 + exp[−φT xi ]) H = H + yi (1 − yi )xi xT i end // Set mean and variance of Laplace approximation µ=φ Σ = −H−1 // Compute mean and variance of activation µa = µT x∗ 2 σa = x∗T Σx∗ // Approximate integral to get Bernoullic parameters 2 λ∗ = 1/(1 + exp[−µa / 1 + πσa /8]) // Compute predictive distribution P r(w∗ |x∗ ) = Bernw∗ [λ∗ ] end // Compute prediction y // Add term to Hessian Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 22. 22 Models for classification Algorithm 9.3: MAP dual logistic regression The dual logistic regression model is the same as the logistic regression model, but now we represent the parameters φ as a weighted sum φ = Xψ of the original data points, where X is a matrix containing all of the training data giving the prediction: P r(w|ψ, x) = Bernw 1 1 + exp[−ψ T XT x] We place a normal prior on the dual parameters ψ and optimize them using the criterion: I ˆ ψ = argmin − ψ log Bernwi i=1 1 1 + exp[−φT Xxi ] 2 − log Normψ [0, σp I] , A typical approach would be to use a second order optimization method such as the Newton method (e.g., using Matlabs fminunc function). The optimization method will need to compute the cost function and its derivative and Hessian with respect to the parameter ψ, and the calculations for these are given in the algorithm below. Algorithm 9.3: Cost and derivatives for MAP dual logistic regression Input : Binary world state {wi }I , observed data {xi }I , parameters ψ i=1 i=1 Output: cost L, gradient g, Hessian H begin // Initialize cost, gradient, Hessian 2 L = −I log[2πσ 2 ]/2 − ψ T ψ/(2σp ) 2 g = −ψ/σp 2 H = −1/σp // Form compound data matrix X = [x1 , x2 , . . . xI ] // For each data point for i=1 to I do // Compute prediction y yi = 1/(1 + exp[−ψ T Xxi ]) // Update log likelihood, gradient and Hessian if wi == 1 then L = L + log[yi ] else L = L + log[1 − yi ] end g = g + (yi − wi )XT xi H = H + yi (1 − yi )XT xi xT X i end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 23. Models for classification 23 Algorithm 9.4: Dual Bayesian logistic regression In dual Bayesian logistic regression, we aim to compute the predictive distribution P r(w∗ |x∗ ) over the binary world state w∗ for a new data example x∗ . This takes the form of a Bernoulli distribution and is hence summarized by the single λ∗ = P r(w∗ = 1|x∗ ). The method works by first finding the MAP solution to the dual problem(using the cost function in the previous algorithm). It then builds a Laplace approximation based on this result and the Hessian at the MAP solution. Using the mean and variance of the Laplace approximation we can compute a probability distribution over the activation. We then use a further approximation to compute the integral over this distribution. As usual, we assume that we have added a one to the start of every data vector so that xi ← [1, xT ]T to model the offset parameter elegantly. i Algorithm 9.4: Dual Bayesian logistic regression Input : Binary world state {wi }I , observed data {xi }I , new data x∗ i=1 i=1 Output: Bernoulli parameter λ∗ from P r(w∗ |x∗ ) for new data x∗ begin // Optimization using cost function of algorithm 9.3 ψ = argminψ − I i=1 2 log Bernwi [1/(1 + exp[−ψ T XT xi ])] − log Normψ [0, σp I] // Compute Hessian at peak 2 H = 1/σp for i=1 to I do yi = 1/(1 + exp[−φT XT xi ]) H = H + yi (1 − yi )XT xi xT X i end // Set mean and variance of Laplace approximation µ=ψ Σ = −H−1 // Compute mean and variance of activation µa = µT XT x∗ 2 σa = x∗T XΣXT x∗ // Compute approximate prediction 2 λ∗ = 1/(1 + exp[−µa / 1 + πσa /8]) end // Compute prediction y // Add term to Hessian Algorithm 9.4b: Gaussian process classification Notice that algorithm 9.4a and algorithm 9.3, which it uses, are defined entirely in terms of inner products of the form xT xj , which usually occur in matrix multiplications like XT x∗ . i This means they is amenable to kernelization. When we replace all of the inner products in algorithm 9.4a with a kernel function K[•, •], the resulting algorithm is called Gaussian process classification or kernel logistic regression. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 24. 24 Models for classification Algorithm 9.5: Relevance vector classification Relevance vector classification is a version of the kernel logistic regression (Gaussian process classification) that encourages the dual parameters ψ to be sparse using a prior that is a product of t-distributions. Since there is one dual parameter for each of the I training examples, we introduce I hidden variables hi which control the tendency to be zero for each dimension. The algorithm is iterative and alternates between updating the hidden variables in closed form and finding the resulting MAP solutions. After the system has converged, we prune the model to remove dimensions where the hidden variable was large (> 1000 is a reasonable criterion); these dimensions contribute very little to the final prediction. Algorithm 9.5: Relevance vector classification Input : (D+1)×I data X, I ×1 binary world vector w, degrees of freedom, ν, kernel K[•, •] Output: Bernoulli parameter λ∗ from P r(w∗ |x∗ ) for new data x∗ begin // Initialize I hidden variables to reasonable values H = diag[1, 1, . . . 1] repeat // Find MAP solution using kernelized version of algorithm 9.3 ψ= argminψ − I log Bernwi [1/(1 + exp[−ψ T K[X, xi ]])] − log Normψ [0, H−1 ] i=1 // Compute Hessian S at peak a S=H for i=1 to I do yi = 1/(1 + exp[−ψ T K[X, xi ]]) // Compute prediction y S = S + yi (1 − yi )K[X, xi ]K[xi , X] // Add term to Hessian end // Set mean and variance of Laplace approximation µ=ψ Σ = −S−1 // For each data example for I=1 to I do // Update the diagonal entry of H hii = (1 − hii Σii + ν)/(µ2 + ν) i end until No further improvement // Remove rows of µ, cols of X, rows and cols of Σ where hdd is large [µ, Σ, X] = prune[µ, Σ, X] // Compute mean and variance of activation µa = µT K[X, x∗ ] 2 σa = K[x∗ , X]ΣK[X, x∗ ] // Compute approximate prediction 2 λ∗ = 1/(1 + exp[−µa / 1 + πσa /8]) end a Notice that I have used S to represent the Hessian here, so that it’s not confused with the diagonal matrix H containing the hidden variables. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 25. Models for classification 25 Algorithm 9.6: Incremental fitting for logistic regression The incremental fitting approach applies to the non-linear model 1 P r(w|φ, x) = Bernw 1 + exp[−φ0 − K k=1 φk f[xi , ξ k ]] . The method initializes the weights {φk }K to zero and then optimizes them one by one. At k=1 the first stage we optimize φ0 , φ1 and ξ 1 . Then we optimize φ0 , φ2 and ξ 2 and so on. Algorithm 9.6: Incremental logistic regression Input : Binary world state {wi }I , observed data {xi }I i=1 i=1 Output: ML parameters φ0 , {φk , ξk }K k=1 begin // Initialize parameters φ0 = 0 // Initialize activation for each data point (sum of first k-1 functions) for i=1 to I do ai = 0 end for k=1 to K do // Reset offset parameter φ0 for i=1 to I do ai = ai − φ0 end [φ0 , φk , ξk ] = argminφ0 ,φk ,ξk − I i=1 log [Bernwi [1/(1 + exp[−ai − φ0 − φk f[xi , ξk ]])]] for i=1 to I do ai = ai + φ0 + φk f[xi , ξk ] end end end Obviously, the derivatives for the optimization algorithm depend on the choice of non-linear function. For example, if we use the function f[xi , ξ k ] = arctan[ξ T xi ] where we have added a k 1 to the start of each data vector xi , then the first derivatives of the cost function L are: I ∂L ∂φ0 = ∂L ∂φk = ∂L ∂ξ = (yi − wi ) i=1 I (yi − wi )atan[ξ T xi ] k i=1 I (yi − wi )φk i=1 1 1 + (ξ T xi )2 k xi where yi = 1/(1 + exp[−ai − φ0 − φk f[xi , ξ k ]] is the current prediction for the ith data point. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 26. 26 Models for classification Algorithm 9.7: Logitboost Logitboost is a special case of non-linear logistic regression, with heaviside step functions: P r(w|φ, x) = Bernw 1 1 + exp[−φ0 − K k=1 φk heaviside[f [x, ξck ]] . One interpretation s that we are combining a set of ’weak classifiers’ which decide on the class based on whether it is to the left or the right of the step in the step function. The step functions do not have smooth derivatives, so at the k th stage, the algorithm exhaustively considers a set of possible step functions {heaviside[f [x, ξm ]]}M , choosing the m=1 index ck ∈ {1, 2, . . . M } that is best, and simultaneously optimizes the weights φ0 and φk . Algorithm 9.7: Logitboost Input : Binary world state {wi }I , observed data {xi }I , functions {fm [x, ξm ]}M m=1 i=1 i=1 Output: ML parameters φ0 , {φk }K , {ck } ∈ {1 . . . M } k=1 begin // Initialize activations for i=1 to I do ai = 0 end // Initialize parameters for k=1 to K do // Find best weak classifier by looking at magnitude of gradient ck = maxm [( I (ai − wi )f[xi , ξm ])2 ] i=1 // Remove effect of offset parameters for i=1 to I do ai = ai − φ0 end φ0 = 0 // Perform optimization [φ0 , φk ] = argminφ0 ,φk I i=1 − log Bernwi 1/(1 + exp[−ai − φ0 − φk f[xi , ξck ]]) // Compute new activation for i=1 to I do ai = ai + φ0 + φk f[xi , ξck ] end end end The derivatives for the optimization are given by I ∂L ∂φ0 = ∂L ∂φk = (yi − wi ) i=1 I (yi − wi )f[xi , ξ ck ] i=1 where yi = 1/(1 + exp[−ai − φ0 − φk f[xi , ξ ck ]] is the current prediction for the ith data point. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 27. Models for classification 27 Algorithm 9.8: Multi-class logistic regression The multi-class logistic regression model is defined as P r(w|φ, x) = Catw softmax[φT x, φT x, . . . φT x] . 1 2 N where we have prepended a 1 to the start of each data vector x. This is a straightforward optimization problem over the negative log probability with respect to the parameter vector φ = [φ1 ; φ2 ; . . . ; φN ]. We need to compute this value, and the derivative and Hessian with respect to the parameters {φ}m . Algorithm 9.8: Cost function, derivative and Hessian for multi-class logistic regression Input : World state {wi }I , observed data {xi }I , parameters {φ}N n=1 i=1 i=1 Output: cost L, gradient g, Hessian H begin // Initialize cost, gradient, Hessian L=0 for n=1 to N do gn = 0 // Part of gradient relating to φn for m=1 to N do Hmn = 0 // Portion of Hessian relating φn and φm end end // For each data point for i=1 to I do // Compute prediction y yi = softmax[φT xi , φT xi , . . . φT xi ] 1 2 k // Update log likelihood th L = L + log[yi,wi ] // Take wi element of yi // Update gradient and Hessian for n=1 to N do gn = gn + (yin − δ[wi − n])xi for m=1 to M do Hmn = Hmn + yim (δ[m − n] − yin )xi xT i end end end // Assemble final gradient vector g = [g1 ; g2 ; . . . gk ] // Assemble final Hessian for n=1 to N do Hn = [Hn1 , Hn2 , . . . HnN ] end H = [H1 ; H2 ; . . . HN ] end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 28. 28 Models for classification Algorithm 9.9: Multi-class logistic classification tree Here, we present a deterministic multi-class classification tree. At the j th branching point, it selects the index cj ∈ {1, 2, . . . , M } indicating which of a pre-determined set of classifiers {g[x, ωm ]}M should be chosen. m=1 Algorithm 9.9: Multiclass classification tree Input : World state {wi }I , data {xi }I }M , classifiers {g[x, ωm ]}M m=1 i=1 i=1 m=1 Output: Categorical params at leaves {λp }J+1 , Classifier indices {cj }J j=1 p=1 begin enqueue[x1...I , w1...I ] // Store data and class labels // For each node in tree for j = 1 to J do [x1...I , w1...I ] = dequeue[ ] // Retrieve data and class labels for m = 1 to M do // Count frequency for kth class in left and right branches for k = 1 to K do (l) nk = I δ[g[xi , ωm ] − 0]δ[wi − k] i=1 (r) nk = I δ[g[xi , ωm ] − 1]δ[wi − k] i=1 end // Compute log likelihood (l) (l) // Contribution from left branch lm = K log[nk / K nq ] k=1 q=1 (r) (r) lm = lm + K log[nk / K nq ] // Contribution from right branch k=1 q=1 end // Store index of best classifier cj = argmaxm [lm ] // Partition into two sets Sl = {}; Sr = {} for i=1 to I do if g[xi , ωcj ] == 0 then SL = Sl ∪ i else SR = Sr ∪ i end end // Add to queue of nodes to process next enqueue[xSl , wSl ] enqueue[xSr , wSr ] end // Recover categorical parameters at J + 1 leaves for p = 1 to J + 1 do [x1...I , w1...I ] = dequeue[ ] for k=1 to K do nk = I δ[wi − k] // Frequency of class k at the pth leaf i=1 end λp = n/ K nk // ML solution for categorical parameter k=1 end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 29. Graphical models 29 Algorithm 10.1: Gibbs’ Sampling from a discrete undirected model This algorithm generates samples from an undirected model with distribution P r(x1...N ) = 1 Z C φc [Sc ], c=1 where the cth function φc [Sc ] operates on a subset of Sc ⊂ {x1 , x2 , . . . , xD } of the D variables and returns a positive number. For this algorithm, we assume that each variable {xd }d is d=1 discrete and takes values xd ∈ {1, 2, . . . , K} In Gibbs’ sampling, we choose each variable in turn and update by sampling from its marginal posterior distribution. Since, the variables are discrete, the marginal distribution is a categorical distribution (a histogram), so we can sample from it by partitioning the range 0 to 1 according to the probabilities, drawing a uniform sample between 0 and 1, and seeing which partition it falls into. Algorithm 10.1: Gibbs’ sampling from undirected model Input : Potential functions {φc [Sc ]}C c=1 Output: Samples {xt }T 1 begin // Initialize first sample in chain x0 = x(0) // For each time sample for t=1 to T do xt = xt−1 // For each dimension for d=1 to D do // For each possible value of the dth variable for k=1 to K do // Set the variable to k xtd = k // Compute the unnormalized marginal probability λk = 1 for c s.t. xd ∈ Sc do λk = λk · φc [Sc ] end end // Normalize the probabilities λ = λ/ K λk k=1 // Draw from categorical distribution xtd = Sample [Catxtd [λ]] end end end It is normal to discard the first few thousand entries so that the initial conditions are forgotten. Then entries are chosen that are spaced apart to avoid correlation between the samples. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 30. 30 Graphical models Algorithm 10.2: Contrastive divergence for learning undirected models The contrastive divergence algorithm is used to learn the parameters θ of an undirected model of the form C P r(x1...N , θ) = 1 1 φc [Sc , θ]. f (x, θ) = Z[θ] Z[θ] c=1 where the cth function φc [Sc ] operates on a subset of Sc ⊂ {x1 , x2 , . . . , xD } of the D variables and returns a positive number. It is generally not possible to maximize log likelihood either in closed form or via a non-linear optimization algorithm, because we cannot compute the denominator Z[θ] that normalizes the distribution and which also depends on the parameters. The contrastive divergence algorithm gets around this problem by computing the approximate gradient by means of generating J samples {x∗ }J and then using this approximate j j=1 gradient to perform gradient descent. The approximate gradient is computed as I ∂L ≈− ∂θ J J j=1 I ∂ log[f (x∗ , θ)] ∂ log[f (xi , θ)] j + . ∂θ ∂θ i=1 In the algorithm below, the function gradient[x, θ] represent the derivative of the unnormalized log likelihood (i.e. the two terms on the right hand side). We’ve also made the simplifying assumption that there is one sample x∗ for each training example xi (i.e., I = J). i In practice, computing valid samples is a burden, so in this algorithm we generate the ith sample x∗ by taking a single Gibbs’ sample step from the ith training example. i Algorithm 10.2: Contrastive divergence learning of undirected model Input : Data {x}K , learning rate α k=1 Output: ML Parameters θ begin // Initialize parameters θ = θ (0) // For each time sample repeat for i=1 to I do // Take a single Gibbs’ sample step from the ith data point x∗ = Sample[xi , θ] i end // Update parameters // Function gradient[•, •] returns derivative of log of unnormalized probability θ = θ + α I (gradient[xi , θ] − gradient[x∗ , θ]) i i=1 until No further average change in θ end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 31. Models for chains and trees 31 Algorithm 11.1: Dynamic programming for chain model This algorithm computes the maximum a posteriori solution for a chain model. The directed chain model has a likelihood and prior that factorize as N P r(x|w) P r(xn |wn ) = n=1 N P r(w) P r(wn |wn−1 ), = n=2 respectively. To find the MAP solution, we minimize the negative log posterior: N w1...N ˆ = argmin − w1...N N log [P r(xn |wn )] − n=1 N = argmin w1...N N Un (wn ) + n=1 log [P r(wn |wn−1 )] n=2 Pn (wn , wn−1 ) . n=2 This cost function can be optimized using dynamic programming. We pass from variables x1 to xN , computing the minimum cost to reach each point, and caching the route. We find the overall minimum at xN and retrieve the cached route. Here, denote the unary cost Un (wn = k) for the nth variable taking value k by Un,k , and the pairwise cost Pn (wn = k, wn−1 = l) for the nth variable taking value k and the n − 1th variable taking value l by Pn,k,l . Algorithm 11.1: Dynamic programming in chain N,K,K Input : Unary costs {Un,k }N,K n=1,k=1 , Pairwise costs {Pn,k,l }n=2,k=1,l=1 Output: Minimum cost path {wn }N n=1 begin // Initialize cumulative sums Sn,k for k=1 to K do S1,k = U1,k end // Work forward through chain for n=2 to N do // Find minimum cost to get to this node Sn,k = Un,k + minl [Sn−1,l + Pn,k,l ] // Store route by which we got here Rn,k = argminl [Sn−1,l + Pn,k,l ] end // Find node yN with overall minimum cost wN = argmink [SN,k ] // Trace back to retrieve route for n=N to 2 do wn−1 = Rn,wn end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 32. 32 Models for chains and trees Algorithm 11.2: Dynamic programming for tree model This algorithm can be used to compute the MAP solution for a directed or undirected model which has the form of a tree. As such, it generalizes algorithm 11.2 which is specialized for chains. As for the simpler case, the algorithm proceeds by working through the nodes, computing the minimum possible cost to reach this position and caching the route by which we reached this point. At the last node we compute the overall minimum cost and then trace back the route using the cached information. Here, denote the unary cost Un (wn = k) for the nth variable taking value k by Un,k . We denote the higher order cost for assigning value K to the nth variable based on its children ch[n] as Hn,k [ch[n]]. This might consist of pairwise, three-wise, or higher costs depending on the number of children in the graph. Algorithm 11.2: Dynamic programming in tree N,K Input : Unary costs {Un,k }N,K n=1,k=1 , higher order cost function {Hn,k [ch[n]]}n=1,k=1 N Output: Minimum cost path {wn }n=1 begin repeat // Retrieve nodes in an order so children always come before parents n = GetNextNode[ ] // For each possible value of this node for k=1 to K do // Compute the minimum cost for reaching here Sn,k = Un,k + min Sch[n] + Hn,k [ch[n]] a // Cache the route for reaching here (store |ch[n]| values) Rn,k = argmin Hn,k [Sch[n] + ch[n]] a end // Push node index onto stack push[n] // Until no more parents until pa[wn ] = {} // Find node wN with overall minimum cost wn = mink [Sn,k ] // Trace back to retrieve route for c=1 to N do n = pop[ ] if ch[n] = {} then wch[n] = Rn,wn end end end a This minimization is done over all the values of all of the children variables. With a pairwise term, this would be a single minimization over the single previous variable that fed into this one. With a three-wise term is would be a joint minimization over both children variables etc. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 33. Models for chains and trees 33 Algorithm 11.3: Forward-backward algorithm This algorithm computes the marginal posterior distributions P r(wn |x1...N ) for a chain model. To find the marginal posteriors we perform a forward recursion and a backward recursion and multiply these two quantities together. Here, we use the term un,k to represent the likelihood P r(xn |wn = k) of the data xn at the nth node taking label k and the term pn,k,l to represent the prior term P r(wn = k|wn−1 = l) when the nth variable takes value k and the n − 1th variable takes value l. Note that un,k and pn,k,l are probabilities, and are not the same as the unary and pairwise costs in the dynamic programming algorithms. Algorithm 11.3: Forward backward algorithm N,K,K Input : Likelihoods {lnk }N,K n=1,k=1 , prior terms {Pn,k,l }n=2,k=1,l=11 N Output: Marginal probability distributions {qn [wn ]}n=1 begin // Initialize forward vector to likelihood of first variable for k=1 to K do f1,k = u1k end // For each state of each subsequent variable for n=2 to N do for k=1 to K do // Forward recursion fn,k = un,k K pn−1,k,l fn−1,l l=1 end end // Initialize vector for backward pass for k=1 to K do bN,k = 1/K end // For each state of each previous variable for n= N to 2 do for k=1 to K do // Backward recursion bn−1,k = K un,l pn,l,k bn,l l=1 end end // Compute marginal posteriors for n= 1 to N do for k=1 to K do // Take product of forward and backward messages and normalize qn [wn = k] = fn,k bn,k /( k fn,l bn,l ) l=1 end end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 34. 34 Models for chains and trees Algorithm 11.4: Sum product belief propagation The sum product algorithm proceeds in two phases: a forward pass and a backward pass. The forward pass distributes evidence through the graph and the backward pass collates this evidence. Both the distribution and collation of evidence are accomplished by passing messages from node to node in the factor graph. Every edge in the graph is connected to exactly one variable node, and each message is defined over the domain of this variable. In the description of the algorithm below, we denote the edges by {en }N , which joins n=1 node en1 to en2 . The edges are processed in such an order that all incoming edges to a function are processed before the outgoing message mn is passed. We first discuss the distribute phase. Algorithm 11.4: Sum product: distribute Input : Observed data {z∗ }n∈Sobs } , functions {φk [Ck ]}K , edges {en }N n n=1 k=1 Output: Forward messages mn on each of the n edges en begin repeat // Retrieve edges in any valid order en = GetNextEdge[ ] // Test for type of edge - returns 1 if en2 is a function, else 0 t = isEdgeToFunction[en ] if t then // If this data was observed if en1 ∈ Sobs then mn = δ[z∗n1 ] e else // Find set of edges that are incoming to start of this edge S = {k : en1 == ek2 } // Take product of messages mn = k∈S mk // Add edge to stack push[en ] end else // Find set of edges incoming to start of this edge S = {k : en1 == ek2 } // Find all variables connected to this function V = eS1 ∪ en2 // Take product of messages mn = y∈S φn [yV ] k∈S mk // Add edge to stack push[n] end until pa[en ] = {} end This algorithm continues overleaf... Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 35. Models for chains and trees 35 Algorithm 11.4b: Collate and compute marginal distributions After the distribute stage is complete (one message has been passed along each edge in the graph) we commence t the second pass through the variables. This happens in the opposite order to the first stage (accomplished by popping edges off the stack). Now, we collate the evidence and compute the normalized distributions at each node. Algorithm 11.4b: Sum product: collate and compute marginal distributions Input : Observed data {z∗ }n∈Sobs } , functions {φk [Ck ]}K , edges {en }N n n=1 k=1 Output: Marginal probability distributions {qn [yn ]}N n=1 begin // Collate evidence repeat // Retrieve edges in opposite order n = pop[ ] // Test for type of edge - returns 1 if en2 is a function, else 0 t = isEdgeToFunction[en ] // Test for type of edge if t then // Find set of edges incoming to function node S = {k : en2 == ek1 } // Find all variables connected to this function V = eS2 ∪ en1 // Take product of messages bn = y∈mathcalS φn [yS ] k∈S bk else // Find set of edges that are incoming to data node S = {k : en2 == ek1 } // Take product of messages bn = k∈S bk end until stack empty // Compute distributions at nodes for k=1 to K do // Find set of edges that are incoming to data node S1 = {n : en2 == k} S2 = {n : en1 == k} qk = n∈S1 mn n∈S2 bn end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 36. 36 Models for grids Algorithm 12.1: Binary graph cuts This algorithm assumes that we have N variables each of which takes a binary value. Their connections are indicated by a series of flags {Emn }N,N which are set to one if the variables n,m=1 are connected (and have an associated pairwise term) or zero otherwise. This algorithm sets up the graph but doesn’t find the min-cut solution. Consult a standard algorithms text for details of how to do this. Algorithm 12.1: Binary graph cuts Input : Unary costs {Un (k)}N,K , pairwise costs {Pn,m (k, l)}N,N,K,K ,flags {emn, }N,N n=1,m=1 n,k=1 n,m,k,l=1 Output: Label assignments wn begin // Initialize graph to empty G = {} for n=1 to N do // Create edges from source and to sink and set capacity to zero G = G ∪ {s, n}; csn = 0 G = G ∪ {n, t}; cnt = 0 // If edge between m and n is desired if em,n = 1 then G = G ∪ {m, n}; cnm = 0 G = G ∪ {n, m}; cmn = 0 end end // Add costs to edges for n=1 to N do csn = csn + Un (0) cnt = cnt + Un (1) for m=1 to n − 1 do if em,n = 1 then cnm = cnm + Pmn (1, 0) − Pmn (1, 1) − Pmn (0, 0) cmn = cmn + Pmn (1, 0) csm = csm + Pmn (0, 0) cnt = cnt + Pmn (1, 1) end end end C = Reparameterize[C] // Ensures all capacities are positive (see overleaf) G = ComputeMinCut[G, C] // Augmenting paths or similar // Read off world state values based on new (cut) graph for n=1 to N do if {s, n} ∈ G then wn = 1 else wn = 0 end end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 37. Models for grids 37 Algorithm 12.2: Reparameterization for graph cuts The previous algorithm relies on a max-flow / min cut algorithm such as augmenting paths or push-relabel. For these algorithms to converge, it is critical that all of the capacities are non-negative. The process of making them non-negative is called re-parameterization. It is only possible in certain special cases, and here the problem is known as submodular. Cost functions in vision tend to encourage smoothing and are submodular. Algorithm 12.2: Reparameterization for binary graph cut Input : Edge flags {emn }N,N , capacities {cmn } : em,n = 1 m,n=1 Output: Modified graph with non-negative capacities begin // For each node pair for n=1 to N do for m=1 to n − 1 do // If an edge between the nodes exist if em,n = 1 then // Test if submodular and return error code if not if cnm < 0 && cmn < −cnm then return[-1] end if cmn < 0 && cnm < −cmn then return[-1] end // Handle links between source and sink if cnm < 0 then β = cnm end if cmn < 0 then β = −cmn end cnm = cnm − β cmn = cmn + β csm = csm + β cmt = cmt + β end end // Handle links between source and sink α = min[csn , cnt ] csn = csn − α cnt = cnt − α end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 38. 38 Models for grids Algorithm 12.3: Multi-label graph cuts This algorithm assumes that we have N variables each of which takes one of K values. Their connections are indicated by a set of flags {emn }N,N which are set to one if the variables are n,m=1 connected (and have an associated pairwise term) or zero otherwise. We construct a graph that has N · (K + 1) nodes where the first K + 1 nodes pertain to the first variable and so on. Algorithm 12.3: Multilabel graph cuts Input : Unary costs {Un (k)}N,K , pairwise costs {Pn,m (k, l)}N,N,K,K , flags {emn, }N,N n=1,m=1 n,k=1 n,m,k,l=1 Output: Label assignments wn begin G = {} // Initialize graph to empty for n=1 to N do // Create edges from source and to sink and set costs G = G ∪ {s, (n − 1)(K + 1) + 1}; cs,(n−1)(K+1)+1 = ∞ G = G ∪ {n(K + 1), t}; c,n(K+1)t = ∞ // Create edges within columns and set costs for k=1 to K do G = G ∪ {(n − 1)(K + 1) + k, (n − 1)(K + 1) + k + 1} c(n−1)(K+1)+k,(n−1)(K+1)+k+1 = U(n−1)(K+1)+k,k G = G ∪ {(n − 1)(K + 1) + k + 1, (n − 1)(K + 1) + k} c(n−1)(K+1)+k+1,(n−1)(K+1)+k = ∞ end // Create edges between columns and set costs for m=1 to n − 1 do if em,n = 1 then for k=1 to K do for L=2 to K + 1 do G = G ∪ {(n − 1)(K + 1) + k(m − 1)(K + 1) + l} c(n−1)(K+1)+k(m−1)(K+1)+l = Pn,m (k, l − 1) + Pn,m (k − 1, l) − Pn,m (k, l) − Pn,m (k − 1, l − 1) end end end end end C = Reparameterize[C] // Ensures all capacities are positive (see book) G = ComputeMinCut[G, C] // Augmenting paths or similar tcpRead off values for n=1 to N do wn = 1 for k=1 to K do if {(n − 1)(K + 1) + k, (n − 1)(K + 1) + k} ∈ G] then wn = wn + 1 end end end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 39. Models for grids 39 Algorithm 12.4: Alpha-expansion algorithm The alpha-expansion algorithm works by breaking the solution down into a series of binary problems, each of which can be solved exactly. At each iteration, we choose one of the K label values α, and for each pixel, we consider either retaining the current label, or switching it to α. The name alpha-expansion derives from the fact that the space occupied by label α in the solution expands at each iteration. The process is iterated until no choice of α causes any change. Each expansion move is guaranteed to lower the overall objective function, although the final result is not guaranteed to be the global minimum. Algorithm 12.4: Alpha expansion algorithm (main loop) Input : Unary costs {Un (k)}N,K , pairwise costs {Pn,m (k, l)}N,N,K,K , flags {emn, }N,N n=1,m=1 n,k=1 n,m,k,l=1 Output: Label assignments {wn }N n=1 begin // Initialize labels in some way - perhaps to minimize unary costs w = w0 // Compute log likelihood M L = N Un (wn ) + N n=1 n=1 m=1 emn Pn,m (wn , wm ) repeat // Store initial log likelihood L0 = L // For each label in turn for k=1 to K do // Try to expand this label (see overleaf) w = AlphaExpand[w, k] end // Compute new log likelihood M L = N Un (wn ) + N n=1 n=1 m=1 Emn Pn,m (wn , wm ) until L = L0 end In the alpha-expansion graph construction, there is one vertex associated with each pixel. Each of these vertices is connected to the source (representing keeping the original label or α) and the sink (representing the label α). To separate source from sink, we must cut one of these two edges at each pixel. The choice of edge will determine whether we keep the original label or set it to α. Accordingly, we associate the unary costs for each edge being set to α or its original label with the two links from each pixel. If the pixel already has label α, then we set the cost of being set to α to ∞. The remaining structure of the graph is dynamic: it changes at each iteration depending on the choice of α and the current labels. There are four possible relationships between adjacent pixels: • They can both already be set to alpha. • One can be set to alpha and the other to another value β. • Both can be set to the same other value β . • They can be set to two other values β and γ. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 40. 40 Models for grids Algorithm 12.4b: Alpha expansion (expand) Algorithm 12.4b: Alpha expansion (expand) Input : Costs {Un (k)}N,K , {Pn,m (k, l)}N,N,K,K , expansion label k, states {wn }N n=1 n,k=1 n,m,k,l=1 Output: New label assignments {wn }N n=1 begin G = {} // Initialize graph to empty z=N // Counter for new nodes added to graph for n=1 to N do G = G ∪ {s, n}; csn = Un (k) // Connect pixel nodes to source and set cost if wn = k then G = G ∪ {n, t}; cnt = ∞ // Connect pixel nodes to sink and set cost else G = G ∪ {n, t}; cnt = Un (wn ) // Connect pixel nodes to sink and set cost end for m=1 to n do if em,n == 1 then if (wn == k || wm == k) then if wn ! = k then G = G ∪ {n, m}; cnm = Pn,m (wm , wn ) // Case 2a end if wm ! = k then G = G ∪ {m, n}; cmn = Pn,m (wn , wm ) // Case 2b end else if wn == wm then G = G ∪ {n, m}; cnm = Pn,m (k, wn ) // Case 3 G = G ∪ {m, n}; cmn = Pn,m (wn , k) else z =z+1 // Increment new node counter G = G ∪ {n, z}; cnz = Pn,m (k, wn ); czn = ∞ // Case 4 G = G ∪ {m, z}; cmz = Pn,m (wm , k)czm = ∞ G = G ∪ {z, t}; czt = Pn,m (wm , wn ) end end end end end C = Reparameterize[C] // Ensures all capacities are positive G = ComputeMinCut[G, C] // Augmenting paths or similar // Read off values for n=1 to N do if {n, t} ∈ G then wn = k end end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 41. Preprocessing 41 Algorithm 13.1: Principal components analysis The goal of PCA is to approximate a set of multivariate data {xi }I with a second set of i=1 variables of reduced size {hi }I , so that i=1 xi ≈ µ + Φhi , where Φ is a rectangular matrix where the columns are unit length and orthogonal to one another so that ΦT Φ = I. This formulation assumes that the number of original data dimensions D is higher than the number of training examples I and so works by taking the singular value decomposition of the I × I matrix XT X to compute the dual principal components Ψ before recovering the original principal components Φ. Algorithm 13.1: Principal components analysis (dual) Input : Training data {xi }I , number of components K i=1 Output: Mean µ, PCA basis functions Φ, low dimensional data {hi }I i=1 begin // Estimate mean µ = I xi /I i=1 // Form mean zero data matrix X = [x1 − µ, x2 − µ, . . . xI − µ] // Do spectral decomposition and compute dual components [Ψ, L, Ψ] = svd[XT X] // Compute principal components Φ = XΨL−1/2 // Retain only the first K columns Φ = [φ1 , φ2 , . . . , φK ] // Convert data to low dimensional representation for i=1 to I do hi = ΦT (xi − µ) end // Reconstruct data for i = 1 to I do ˜ xi = µ + Φhi end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 42. 42 Preprocessing Algorithm 13.2: k-means algorithm The goal of the k-means algorithm is to partition a set of data {xi }I into K clusters. It i=1 can be thought of as approximating each data point with the associated cluster mean µk , so that xi ≈ µhi , where hi ∈ {1, 2, . . . K} is a discrete variable that indicates which cluster the ith point belongs to. The algorithm works by alternately (i) assigning data points to the nearest cluster center and (ii) Algorithm 13.2: K-means algorithm Input : Data {xi }I , number of clusters K, data dimension D i=1 Output: Cluster means {µk }K , cluster assignment indices, {hi }I i=1 k=1 begin // Initialize cluster means (one of many heuristics) µ = I xi /I // Compute overall mean i=1 Σ = I (xi − µ)T (xi − µ)/I // Compute overall covariance i=1 for k=1 to K do µk = µ + Σ1/2 randn[D, 1] // Randomly draw from normal model end // Main loop repeat // Compute distance from data points to cluster means for i=1 to I do for k=1 to K do dik = (xi − µk )T (xi − µk ) end // Update cluster assignments based on closest cluster hi = argmink [dik ] end // Update cluster means from data that was assigned to this cluster for k=1 to K do µk = ( I δ[hi − k]xi )/( I δ[hi − k]) i=1 i=1 end until No further change in {µk }K k=1 end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 43. The pinhole camera 43 Algorithm 14.1: ML learning of camera extrinsic parameters Given a known object, with I distinct three-dimensional points {wi }I points, their correi=1 sponding projections in the image {xi }I and known camera parameters Λ, estimate the i=1 geometric relationship between the camera and the object determined by the rotation Ω and the translation τ . The solution to this problem is to minimize: I T ˆ ˆ Ω, τ = argmin Ω,τ (xi −pinhole[wi , Λ, Ω, τ ]) (xi −pinhole[wi , Λ, Ω, τ ]) i=1 where pinhole[wi , Λ, Ω, τ ] represents the action of the pinhole camera (equation 14.8 from the book. The bulk of this algorithm consists of finding a good initial starting point for this minimization. This optimization should be carried out while enforcing the constraint that Ω remains a valid rotation matrix. Algorithm 14.1: ML learning of extrinsic parameters Input : Intrinsic matrix Λ, pairs of points {xi , wi }I i=1 Output: Extrinsic parameters: rotation Ω and translation τ begin for i=1 to I do // Convert to normalized camera coordinates xi = Λ−1 [xi ; yi ; 1] // Compute linear constraints a1i = [ui , vi , wi , 1, 0, 0, 0, 0, −ui xi , −vi xi , −wi xi , −xi ] a2i = [0, 0, 0, 0, ui , vi , wi , 1, −ui yi , −vi yi , −wi yi , −yi ] end // Stack linear constraints A = [a11 ; a21 ; a12 ; a22 ; . . . a1I ; a2I ] // Solve with SVD [U, L, V] = svd[A] b = v12 // extract last column of V // Extract estimates up to unknown scale ˜ Ω = [b1 , b2 , b3 ; b5 , b6 , b7 ; b9 ; b10 , b11 ] ˜ τ = [b4 ; b8 ; b12 ] // Find closest rotation using Procrustes method ˜ [U, L, V] = svd[Ω] T Ω = UV // Rescale translation ˜ ˜ i=1 3 (Ωij /Ωij )/9 τ =τ 3 j=1 // Use these parameters as initial conditions in non-linear optimization [Ω, τ ] = argminΩ,τ I i=1 (xi −pinhole[wi , Λ, Ω, τ ])T (xi −pinhole[wi , Λ, Ω, τ ]) end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 44. 44 The pinhole camera Algorithm 14.2: ML learning of intrinsic parameters (camera calibration) Given a known object, with I distinct 3D points {wi }I points and their corresponding i=1 projections in the image {xi }I , establish the camera parameters Λ. In order to do this we i=1 need also to estimate the extrinsic parameters. We use the following criterion I T ˆ Λ = argmin min Λ Ω,τ (xi −pinhole[wi , Λ, Ω, τ ]) (xi −pinhole[wi , Λ, Ω, τ ]) i=1 where pinhole[wi , Λ, Ω, τ ] represents the action of the pinhole camera (equation 14.8 from the book). This algorithm consists of an alternating approach in which the extrinsic parameters are found using the previous algorithm and then the intrinsic parameters are found in closed form. Finally, these estimates should form the starting point for a non-linear optimization process over all of the unknown parameters. Algorithm 14.2: ML learning of intrinsic parameters Input : World points {wi }I , image points {xi }I , initial Λ i=1 i=1 Output: Intrinsic parameters Λ begin // Main loop for alternating optimization for t=1 to T do // Compute extrinsic parameters [Ω, τ ] = calcExtrinsic[Λ, {wi , xi }I ] i=1 // Compute intrinsic parameters for i=1 to I do // Compute matrix Ai ai = (ω 1• wi + τ x )/(ω 3• wi + τ z ) bi = (ω 2• wi + τ y )/(ω 3• wi + τ z ) Ai = [ai , bi , 1, 0, 0; 0, 0, 0, bi , 1] end // Concatenate matrices and data points x = [x1 ; x2 ; . . . xI ] A = [A1 ; A2 ; . . . AI ] // Compute parameters θ = (AT A)−1 AT x Λ = [θ 1 , θ 2 , θ 3 ; 0, θ 4 , θ 5 ; 0, 0, 1] end // Refine parameters with non-linear optimization Λ = argminΛ minΩ,τ I i=1 // ω k• is kth row of Ω (xi −pinhole[wi , Λ, Ω, τ ])T (xi −pinhole[wi , Λ, Ω, τ ]) end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 45. The pinhole camera 45 Algorithm 14.3: Inferring 3D world points (reconstruction) Given J calibrated cameras in known positions (i.e. cameras with known Λ, Ω, τ ), viewing the same three-dimensional point w and knowing the corresponding projections in the images {xj }J , establish the position of the point in the world. j=1 As for the previous algorithms the final solution depends on a non-linear minimization of the reprojection error between w and the observed data xj ,  T (xj −pinhole[w, Λj , Ωj , τ j ]) (xj −pinhole[w, Λj , Ωj , τ j ]) ˆ w = argmin  w  J j=1 The algorithm below finds a good approximate initial conditions for this minimization using a closed-form least-squares solution. Algorithm 14.3: Inferring 3D world position Input : Image points {xj }J , camera parameters {Λj , Ωj , τ j }J j=1 j=1 Output: 3D world point w begin for j=1 to J do // Convert to normalized camera coordinates xj = Λ−1 [xj , yj , 1]T j // Compute linear constraints a1j = [ω31j xj − ω11j , ω32j xj − ω12j , ω33j xj − ω13j ] a2j = [ω31j yj − ω21j , ω32j yj − ω22j , ω33j yj − ω23j ] bj = [τxj − τzj xj ; τyj − τzj yj ] end // Stack linear constraints A = [a11 ; a21 ; a12 ; a22 ; . . . a1J ; a2J ] b = [b1 ; b2 ; . . . bJ ] // LS solution for parameters w = (AT A)−1 AT b // Refine parameters with non-linear optimization ˆ w = argminw J j=1 (xj −pinhole[w, Λj , Ωj , τ j ])T (xj −pinhole[w, Λj , Ωj , τ j ]) end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 46. 46 Models for transformations Algorithm 15.1: ML learning of Euclidean transformation The Euclidean transformation model maps one set of 2D points {wi }I to another set {xi }I i=1 i=1 with a rotation Ω and a translation τ . To recove these parameters we use the criterion I ˆ ˆ Ω, τ = argmin − Ω,τ log Normxi Ωwi + τ , σ 2 I i=1 where Ω is constrained to be a rotation matrix so that ΩT Ω = I and det[Ω] = 1. Algorithm 15.1: Maximum likelihood learning of Euclidean transformation Input : Training data pairs {xi , wi }I i=1 Output: Rotation Ω, translation τ , variance, σ 2 begin // Compute mean of two data sets µw = I wi /I i=1 µx = I xi /I i=1 // Concatenate data into matrix form W = [w1 − µw , w2 − µw , . . . , wI − µw ] X = [x1 − µx , x2 − µx , . . . , xI − µx ] // Solve for rotation [U, L, V] = svd[WXT ] Ω = VUT // Solve for translation τ = I (xi − Ωwi )/I i=1 end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 47. Models for transformations 47 Algorithm 15.2: ML learning of similarity transformation The similarity transformation model maps one set of 2D points {wi }I to another set {xi }I i=1 i=1 with a rotation Ω, a translation τ and a scaling factor ρ. To recover these parameters we use the criterion: I ˆ ˆ ˆ Ω, τ , ρ = argmin − Ω,τ ,ρ log Normxi ρΩwi + τ , σ 2 I i=1 where Ω is constrained to be a rotation matrix so that ΩT Ω = I and det[Ω] = 1. Algorithm 15.2: Maximum likelihood learning of similarity transformation Input : Training data pairs {xi , wi }I i=1 Output: Rotation Ω, translation τ , scale ρ, variance σ 2 begin // Compute mean of two data sets µw = I wi /I i=1 µx = I xi /I i=1 // Concatenate data into matrix form W = [w1 − µw , w2 − µw , . . . , wI − µw ] X = [x1 − µx , x2 − µx , . . . , xI − µx ] // Solve for rotation [U, L, V] = svd[WXT ] Ω = VUT // Solve for scaling ρ = ( I (xi − µx )T Ω(wi − µw ))/( I (wi − µw )T (w − µw )) i=1 i=1 // Solve for translation τ = I (xi − ρΩwi )/I i=1 end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 48. 48 Models for transformations Algorithm 15.3: ML learning of affine transformation The affine transformation model maps one set of 2D points {wi }I to another set {xi }I i=1 i=1 with a linear transformation Φ and an offset τ . To recover these parameters we use the criterion I ˆ ˆ Φ, τ = argmin − Φ,τ log Normxi Φwi + τ , σ 2 I . i=1 Algorithm 15.3: Maximum likelihood learning of affine transformation Input : Training data pairs {xi , wi }I i=1 Output: Linear transformation Φ, offset τ , variance σ 2 begin // Compute intermediate 2×6 matrices Ai for i=1 to I do T T Ai = [wi , 1, 0T ; 0T , wi , 1] end // Concatenate matrices Ai into 2I ×6 matrix A A = [A1 ; A2 ; . . . AI ] // Concatenate output points into 2I × 1 vector c c = [x1 ; x2 ; . . . xI ] // Solve for linear transformation φ = (AT A)−1 AT c // Extract parameters Φ = [φ1 , φ2 ; φ4 , φ5 ] τ = [φ3 ; φ6 ] // Solve for variance σ 2 = I (xi − φwi − τ )T (xi − φwi − τ )/2I i=1 end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 49. Models for transformations 49 Algorithm 15.4: ML learning of projective transformation (homography) The projective transformation model maps one set of 2D points {wi }I to another set i=1 {xi }I } with a non-linear transformation with 3 × 3 parameter matrix Φ. To recover this i=1 matrix we use the criterion I ˆ Φ = argmin − Φ log Normxi proj[wi , Φ], σ 2 I . i=1 where the function proj[wi , Φ] applies the homography to point wi and is defined as proj[wi , Φ] = φ11 u+φ12 v+φ13 φ31 u+φ32 v+φ33 φ21 u+φ22 v+φ23 φ31 u+φ32 v+φ33 T . Unlike the previous three transformations, it is not possible to minimize this criterion in closed form. The best that we can do is to get an approximate solution and use this to start a non-linear minimization process. Algorithm 15.4: Maximum likelihood learning of projective transformation Input : Training data pairs {xi , wi }I i=1 Output: Parameter matrix Φ,, variance σ 2 begin // Convert data to homogeneous representation for i=1 to I do ˜ xi = [xi ; 1] end // Compute intermediate 2×9 matrices Ai for i=1 to I do ˜ ˜ ˜ ˜ Ai = [0, wi ; −wi , 0; yi wi , −xi wi ]T end // Concatenate matrices Ai into 2I ×9 matrix A A = [A1 ; A2 ; . . . AI ] // Solve for approximate parameters [U, L, V] = svd[A] Φ0 = [v19 , v29 , v39 ; v49 , v59 , v69 ; v79 , v89 , v99 ] // Refine parameters with non-linear optimization ˆ . Φ = argmin − I log Normx proj[wi , Φ], σ 2 I Φ i=1 i end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 50. 50 Models for transformations Algorithm 15.5: ML Inference for transformation models Consider a transformation model maps one set of 2D points {wi }I to another set {xi }I i=1 i=1 so that P r(xi |wi , Φ) = Normxi trans[wi , Φ], σ 2 I . In inference we wish are given a new data point x = [x, y] and wish to compute the most likely point w = [u, v] that was responsible for it. To make progress, we consider the transformation model trans[wi , Φ] in homogeneous form      x φ11 φ12 φ13 u λ y  = φ21 φ22 φ23  v  , 1 φ31 φ32 φ33 1 ˜ ˜ or x = Φw. The Euclidean, similarity, affine and projective transformations can all be expressed as a 3 × 3 matrix of this kind. Algorithm 15.5: Maximum likelihood inference for transformation models Input : Transformation parameters Φ, new point x Output: point w begin // Convert data to homogeneous representation ˜ x = [x; 1] // Apply inverse transform ˜ a = Φ−1 x // Convert back to Cartesian coordinates w = [a1 /a3 , a2 /a3 ] end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 51. Models for transformations 51 Algorithm 15.6: Learning extrinsic parameters (planar scene) Consider a calibrated camera with known parameters Λ viewing a planar. We are given a I set of 2D positions on the plane {wi=1 } (measured in real world units like cm) and their I corresponding 2D pixel positions xi−1 . The goal of this algorithm is to learn the 3D rotation Ω and translation τ that maps a point in the frame of reference of the plane w = [u, v, w]T (w = 0 on the plane) into the frame of reference of the camera. This goal is accomplished by minimizing the following criterion: I T ˆ ˆ Ω, τ = argmin Ω,τ (xi −pinhole[wi , Λ, Ω, τ ]) (xi −pinhole[wi , Λ, Ω, τ ]) i=1 This optimization should be carried out while enforcing the constraint that Ω remains a valid rotation matrix. The bulk of this algorithm consists of computing a good initialization point for this minimization procedure. Algorithm 15.6: ML learning of extrinsic parameters (planar scene) Input : Intrinsic matrix Λ, pairs of points {xi , wi }I i=1 Output: Extrinsic parameters: rotation Ω and translation τ begin // Compute homography between pairs of points Φ = LearnHomography[{xi }I , {wi }I ] i=1 i=1 // Eliminate effect of intrinsic parameters Φ = Λ−1 Φ // Compute SVD of first two columns of Φ [ULV] = svd[φ1 , φ2 ] // Estimate first two columns of rotation matrix [ω 1 , ω 2 ] = [u1 , u2 ] ∗ VT // Estimate third column by taking cross product ω3 = ω1 × ω2 Ω = [ω 1 , ω 2 , ω 3 ] // Check that determinant is not minus 1 if |Ω| < 0 then Ω = [ω 1 , ω 2 , −ω 3 ] end // Compute scaling factor for translation vector 2 λ=( 3 i=1 j=1 ω ij /φij )/6 // Compute translation τ = λφ3 // Refine parameters with non-linear optimization I ˆ ˆ Ω, τ = argmin (xi −pinhole[wi , Λ, Ω, τ ])T (xi −pinhole[wi , Λ, Ω, τ ]) Ω,τ i=1 end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 52. 52 Models for transformations Algorithm 15.7: Learning intrinsic parameters (planar scene) This is also known as camera calibration from a plane. The camera is presented with J views of a plane with unknown pose {Ωj , τ j }. For each image we know I points {wi }I where i=1 wi = [ui , vi , 0] and we know their imaged positions {xij }I,J i=1,j=1 in each of the J scenes. The goal is to compute the intrinsic matrix Λ. To this end, we use the criterion:  J ˆ Λ = argmin  Λ T (xij −pinhole[wi , Λ, Ωj , τ j ]) (xij −pinhole[wi , Λ, Ωj , τ j ])  min j=1  I Ωj ,τ j i=1 where again, the minimization must be carried out while ensuring that Ω is a valid rotation matrix. The strategy is to alternately estimate the extrinsic parameters using the previous algorithm and compute the intrinsic parameters in closed form. After several iterations we use the resulting solution as initial conditions for a non-linear optimization procedure. Algorithm 15.7: ML learning of intrinsic parameters (planar scene) Input : World points {wi }I , image points {xij }I,J i=1 i=1,j=1 , initial Λ Output: Intrinsic parameters Λ begin // Main loop for alternating optimization for k=1 to K do // Compute extrinsic parameters for each image for j=1 to J do [Ωj , τ j ] = calcExtrinsic[Λ, {wi , xij }I ] i=1 end // Compute intrinsic parameters for i=1 to I do for j=1 to J do // Compute matrix Aij // ω k•j is kth row of Ωj aij = (ω T wi + τ xj )/(ω T wi + τ zj ) 3•j 1•j T T bij = (ω 2•j wi + τ yj )/(ω 3•j wi + τ zj ) // τzj is z component of τ j Aij = [aij , bij , 1, 0, 0; 0, 0, 0, bij , 1] end end // Concatenate matrices and data points x = [x11 ; x12 ; . . . xIJ ] A = [A11 ; A12 ; . . . AIJ ] // Compute parameters θ = (AT A)−1 AT x Λ = [θ 1 , θ 2 , θ 3 ; 0, θ 4 , θ 5 ; 0, 0, 1] end // Refine parameters with non-linear optimization ˆ Λ= T argminΛ j minΩj ,τ j i (xij −pinhole[wi , Λ, Ωj , τ j ]) (xij −pinhole[wi , Λ, Ωj , τ j ]) end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 53. Models for transformations 53 Algorithm 15.8: Robust learning of projective transformation with RANSAC The goal of this algorithm is to fit a homography that maps one set of 2D points {wi }I to i=1 another set {xi }I }, in the case where some of the point matches are known to be wrong i=1 (outliers). The algorithm also returns the true matches and the outliers. The algorithm uses the RANSAC procedure - it repeatedly computes the homography based on a minimal subset of matches. Since there are 8 unknowns in the 3 × 3 matrix that defines the homography, and each match provides two linear constrains (due to the x− and y−coordinates), we need a minimum of four matches to compute the homography. The RANSAC procedure chooses these four matches randomly, computes the homography, and then looks for the amount of agreement in the rest of the dataset. After many iterations of this procedure, we recompute the homography based on the randomly chosen matches with the best agreement and the points that agreed with it (the inliers). Algorithm 15.8: Robust ML learning of homography Input : Point pairs {xi , wi }I , number of RANSAC steps N , threshold τ i=1 Output: Homography Φ, inlier indices I begin // Initialize best inlier set to empty B = {} for n=1 to N do // Draw 4 different random integers between 1 and I R = RandomSubset[{1 . . . I}, 4] // Compute homography (algorithm 15.4) Φn = LearnHomography[{xi }i∈R , {wi }i∈R ] // Initialize set of inliers to empty Sn = {} for i=1 to I do // Compute squared distance d = (xi − proj[wi , Φn ])T (xi − proj[wi , Φn ]) // If small enough then add to inliers if d < τ 2 then Sn = Sn ∩ {i} end end // If best outliers so far then store if |Sn | > |B| then B = Sn end end // Compute homography from all outliers Φ = LearnHomography[{xi }i∈B , {wi }i∈B ] end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 54. 54 Models for transformations Algorithm 15.9: Sequential RANSAC for fitting homographies Sequential RANSAC fits K homographies to disjoint subsets of point pairs {wi , xi }I . This i=1 procedure is greedy – the algorithm fits the first homography, then removes the inliers from this set from the point pairs and tries to fit a second homography to the remaining points. In principle, this algorithm can find a set of matching planes between two images. However, in practice, it often makes mistakes. It does not exploit information about the spatial coherence of matches and it cannot recover from mistakes in the greedy matching procedure. Algorithm 15.9: Robust sequential learning of homographies Input : Points {xi , wi }I , RANSAC steps N , inlier threshold τ , number of homographies K i=1 Output: K homographies Φk , and associated inlier indices Ik begin // Initialize set of indices of remaining point pairs S = {1 . . . I} for k=1 to K do // Compute homography using RANSAC (algorithm 51) [Φk , Ik ] = LearnHomographyRobust[{xi }i∈S , {wi }i∈S , N, τ ] // Remove inliers from remaining points S = SIk // Check that there are enough remaining points if |S| < 4 then break end end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 55. Models for transformations 55 Algorithm 15.10: PEaRL for fitting homographies The propose, expand and re-learn (PEaRL) attempts to make up for the deficiencies of sequential RANSAC for fitting homographies. It first proposes a large number of possible homographies relating point pairs {wi , xi }I . These then compete for the point pairs to be i=1 assigned to them and they are re-learnt based on these assignments. The algorithm has a spatial component that encourages nearby points to belong to the same model, and it iterative rather than greedy and so can recover from errors. Algorithm 15.10: PEaRL learning of homographies Input : Point pairs {xi , wi }I , number of initial models M , inlier threshold τ , mininum number i=1 of inliers l, number of iterations J, neighborhood system {Ni }I , pairwise cost P i=1 Output: Set of homographies Φk , and associated inlier indices Ik begin // Propose Step: generate M hypotheses m=1 // hypothesis number repeat // Draw 4 different random integers between 1 and I R = RandomSubset[{1 . . . I}, 4] // Compute homography (algorithm 47) Φm = LearnHomography[{xi }i∈R , {wi }i∈R ] Im = {} // Initialize inlier set to empty for i=1 to I do dim = (xi − proj[wi , Φn ])T (xi − proj[wi , Φn ]) if dim < τ 2 then // if distance small, add to inliers In = In ∩ {i} end end if |Im | ≥ l then // If enough inliers, get next hypothesis m=m+1 end until m < M for j=1 to J do // Expand Step: returns I × 1 label vector l l = AlphaExpand[D, P, {Ni }I ] i=1 // Re-Learn Step: re-estimate homographies with support for m=1 to M do Im = find[L == m] // Extract points with label L // If enough support then re-learn, update distances if |Im | ≥ 4 then Φm = LearnHomography[{xi }i∈Im , {wi }i∈Im ] for i=1 to I do dim = (xi − proj[wi , Φn ])T (xi − proj[wi , Φn ]) end end end end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 56. 56 Multiple cameras Algorithm 16.1: Camera geometry from point matches This algorithm finds approximate estimates of the rotation and translation (up to scale) between two cameras given a set of I point matches {xi1 , xi2 }I between two images. More i=1 precisely, the algorithm assumes that the first camera is at the world origin and recovers the extrinsic parameters of the second camera. There is a fourfold ambiguity in the possible solution due to the symmetry of the camera model - it allows for points that are behind the camera to be imaged, although this is clearly not possible in the real world. This algorithm distinguishes between these four solutions by reconstructing all of the points with each and choosing the solution where the largest number are in front of both cameras. Algorithm 16.1: Extracting relative camera position from point matches Input : Point pairs {xi1 , xi2 }I , intrinsic matrices Λ1 , Λ2 i=1 Output: Rotation Ω, translation τ between cameras begin // Compute fundamental matrix (algorithm 16.2) F = ComputeFundamental[{x1i , x2i }I ] i=1 // Compute essential matrix E = ΛT FΛ1 2 // Extract four possible rotation and translations from E W = [0, −1, 0; 1, 0, 0; 0, 0, −1] [U, L, V] = svd[E] τ1 = ULWUT ; Ω1 = UW−1 VT τ2 = ULW−1 UT ; Ω2 = UWVT τ3 = −τ1 ; Ω3 = Ω1 τ4 = −τ2 ; Ω4 = Ω1 // For each possibility for k=1 to K do tk = 0 // number of points in front of camera for kth soln // For each point for i=1 to I do // Reconstruct point (algorithm 14.3) w = Reconstruct[xi1 , xi2 , Λ1 , Λ2 , 0, I, Ωk , τ k ] // Compute point in frame of reference of second camera w = Ωk + τ k // Test if point reconstructed in front of both cameras if w3 > 0 & w3 > 0 then tk = tk + 1 end end end // Choose solution with most support k = argmaxk [tk ] Ω = Ωk τ = τk end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 57. Multiple cameras 57 Algorithm 16.2: Eight point algorithm for fundamental matrix This algorithm takes a set of I ≥ 8 point correspondences {xi1 , xi2 }I between two images i=1 and computes the fundamental matrix using the 8 point algorithm. To improve the numerical stability of the algorithm, the point positions are transformed to have unit mean and spherical covariance before the calculation proceeds. The resulting fundamental matrix is modified to compensate for this transformation. This algorithm is usually used to compute an initial estimate for a subsequent non-linear optimization of the symmetric epipolar distance. Algorithm 16.2: Eight point algorithm for fundamental matrix Input : Point pairs {x1i , x2i }I i=1 Output: Fundamental matrix F begin // Compute statistics of data µ1 = I x1i /I i=1 Σ1 = I (x1i − µ1 )(x1i − µ1 )/I i=1 µ2 = I x2i /I i=1 Σ2 = I (x2i − µ2 )(x2i − µ2 )/I i=1 for k=1 to K do // Compute transformed coordinates −1/2 (xi1 − µ1 ) xi1 = Σ1 −1/2 (xi2 − µ2 ) xi2 = Σ2 // Compute constraint Ai = [xi2 xi1 , xi2 yi1 , xi2 , yi2 xi1 , yi2 yi1 , yi2 , xi1 , yi1 , 1] end // Append constraints and solve A = [A1 ; A2 ; . . . AI ] [U, L, V] = svd[A] F = [v19 , v29 , v39 ; v49 , v59 , v69 ; v79 , v89 , v99 ] // Compensate for transformation −1/2 −1/2 µ1 ; 0, 0, 1] , Σ1 T1 = [Σ1 −1/2 −1/2 µ2 ; 0, 0, 1] , Σ2 T2 = [Σ2 F = TT FT1 2 // Ensure that matrix has rank 2 [U, L, V] = svd[F] l33 = 0 F = ULVT end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 58. 58 Multiple cameras Algorithm 16.3: Robust computation of fundamental matrix with RANSAC The goal of this algorithm is to estimate the fundamental matrix from 2D point pairs {xi1 , xi2 }I to another in the case where some of the point matches are known to be wrong i=1 (outliers). The robustness is achieved by applying the RANSAC algorithm. Since the fundamental matrix has a eight unknown quantities, we randomly select eight point pairs at each stage of the algorithm (each pair contributes one constraint). The algorithm also returns the true matches. Algorithm 16.3: Robust ML fitting of fundamental matrix Input : Point pairs {xi1 , xi2 }I , number of RANSAC steps N , threshold τ i=1 Output: Fundamental matrix F, set of inlier indices I begin // Initialize best inlier set to empty I = {} for n=1 to N do // Draw 8 different random integers between 1 and I R = RandomSubset[{1 . . . I}, 8] // Compute fundamental matrix (algorithm 16.2) Φn = ComputeFundamental[{xi1 }i∈R , {xi2 }i∈R ] // Initialize set of inliers to empty Sn = {} for i=1 to I do // Compute epipolar line in first image ˜ xi2 = [xi2 ; 1] l = xT F ˜i2 // Compute squared distance to epipolar line 2 2 d1 = (l1 xi1 + l2 yi1 + l3 )2 /(l1 + l2 ) // Compute epipolar line in second image ˜ xi1 = [xi1 ; 1] l2 = F˜i1 x // Compute squared distance to epipolar line 2 2 d2 = (l1 xi2 + l2 yi2 + l3 )2 /(l1 + l2 ) // If small enough then add to inliers if (d1 < τ 2 ) && (d2 < τ 2 ) then Sn = Sn ∩ {i} end end // If best outliers so far then store if |Sn | > |I| then I = Sn end end // Compute fundamental matrix from all outliers Φ = ComputeFundamental[{xi1 }i∈I , {xi2 }i∈I ] end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 59. Multiple cameras 59 Algorithm 16.4: Planar rectification This algorithm computes homographies that can be used to rectify the two images. The homography for this second image is chosen so that it moves the epipole to infinity along the x−axis. The homography for the first image is chosen so that the matches are on the same horizontal lines as in the first image and the distance between the matches is smallest in a least squares sense (i.e., the disparity is smallest). Algorithm 16.4: Planar rectification Input : Point pairs {xi1 , xi2 }I i=1 Output: Homographies Φ1 , Φ2 to transform first and second images begin // Compute fundamental matrix (algorithm 55) F = ComputeFundamental[{x1i , x2i }I ] i=1 // Compute epipole in image 2 [U, L, V] = svd[F] e = [u13 , u23 , u33 ]T // Compute three transformation matrices T1 = [0, 0, −δx ; 0, 0, δy , 0, 0, 1] θ = atan2[ey − δy , ex − δx ] T2 = [cos[θ], sin[θ], 0; − sin[θ], cos[θ], 0; 0, 0, 1] T3 = [1, 0, 0; 0, 1, 0, −1/(cos[θ], sin[θ]), 0, 1]] // Compute homography for second image Φ2 = T3 T2 , T1 // Compute factorization of fundamental matrix L = diag[l11 , l22 , (l11 + l22 )/2] W = [0, −1, 0; 1, 0, 0; 0, 0, 1] M = ULWVT // Prepare matrix for soln for Φ1 for k=1 to K do xi1 = hom[xi1 , Φ2 M] // Transform points xi2 = hom[xi2 , Φ2 ] // Create elements of A and b Ai = [xi1 , yi1 , 1] bi = xi2 end // Concatenate elements of A and b A = [A1 ; A2 ; . . . AI ] b = [b1 ; b2 ; . . . bI ] // Solve for α α = (AT A)−1 AT b // Calculate homography in first image Φ1 = (I + [1, 0, 0]T αT )Φ2 M end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 60. 60 Models for shape Algorithm 17.1: Generalized Procrustes analysis The goal of generalized Procrustes analysis is to align a set of shape vectors {wi }I with i=1 respect to a given transformation family (Euclidean, similarity, affine etc.). Each shape vector T T T consists of a set of N 2D points wi = [wi1 , wi2 , . . . wiN ]T . In the algorithm below, we will use the example of registering with respect to a Similarity transformation, which consists of a rotation Ω, scaling ρ and translation τ . Algorithm 17.1: Generalized Procrustes analysis Input : Shape vectors {wi }I , number of factors, K i=1 ˜ Output: Template w, transformations {Ωi , ρi , τ i }I , number of iterations K i=1 begin ˜ Initialize w = w1 // Main iteration loop for k=1 to K do // For each transformation for i=1 to I do // Compute transformation to template (algorithm 15.2) ˜ n=1 [Ωi , ρi , τ i ] = EstimateSimilarity[{wn }N , {win }N ] n=1 end // Update template (average of inverse transform) ˜ wi = I ΩT (win − τ i )/(Iρi ) i i=1 // Normalize template ˜ ˜ ˜ wi = wi /|wi | end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 61. Models for shape 61 Algorithm 17.2: Probabilistic principal components analysis The probabilistic principal components analysis algorithm describes a set of I D × 1 data examples {xi }I with the model i=1 P r(xi ) = Normxi [µ, ΦΦT + σ 2 I] where µ is the D×1 mean vector, Φ is a D×K matrix containing the K principal components in its columns. The principal components define a K dimensional subspace and the parameter σ 2 explains the variation of the data around this subspace. Notice that this model is very similar to factor analysis (see Algorithm 6.3). The only difference is that here we have spherical additive noise σ 2 I rather than a diagonal noise components Σ. This small change has important ramifications for the learning algorithm; we no longer need to use an iterative learning procedure based on the EM algorithm and can instead learn the parameters in closed form. Algorithm 17.2: ML learning of PPCA model Input : Training data {xi }I , number of principal components, K i=1 Output: Parameters µ, Φ, σ 2 begin // Estimate mean parameter µ = I xi /I i=1 // Form matrix of mean-zero data X = [x1 − µ, x2 − µ, . . . , xI − µ] // Decompose X to matrices U, L, V [VLVT ] = svd[XT X] U = WVL−1/2 // Estimate noise parameter σ2 = D j=K+1 ljj /(D − K) // Estimate principal components Uk = [u1 , u2 , . . . uK ] Lk = diag[l11 , l22 , . . . lKK ] Φ = UK (LK − σ 2 I)1/2 end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 62. 62 Models for style and identity Algorithm 18.1: ML learning of subspace identity model This describes the jth of J data examples from the ith of I identities as xij = µ + Φhi + ij , where xij is the D×1 observed data, µ is the D×1 mean vector, Φ is the D×K factor matrix, hi is the K ×1 hidden variable representing the identity and ij is a D ×1 additive normal noise multivariate noise with diagonal covariance Σ. Algorithm 18.1: Maximum likelihood learning for identity subspace model Input : Training data {xij }I,J i=1,j=1 , number of factors, K Output: Maximum likelihood estimates of parameters θ = {µ, Φ, Σ} begin Initialize θ = θ 0 a // Set mean J µ= I i=1 j=1 xij /IJ repeat // Expectation step for i=1 to I do E[hi ] = (JΦT Σ−1 Φ + I)−1 ΦT Σ−1 J (xij − µ) j=1 E[hi hT ] = (JΦT Σ−1 Φ + I)−1 + E[hi ]E[hi ]T i end // Maximization step Φ= I J i=1 j=1 (xij − I J i=1 j=1 diag µ)E[hi ]T I i=1 JE[hi hT ] i −1 1 Σ = IJ (xij − µ)(xij − µ)T − ΦE[hi ](xij − µ)T // Compute data log likelihood for i=1 to I do // compound data vector, JD×1 xi = [xT , xT , . . . , xT ]T i1 i2 iJ end µ = [µT , µT . . . µT ]T // compound mean vector, JD×1 Φ = [ΦT , ΦT . . . ΦT ]T // compound factor matrix, JD×K Σ = diag[Σ, Σ, . . . Σ] // compound covariance, JD×JD L= I i=1 log Normxi [µ , Φ Φ T + Σ ] b until No further improvement in L end a It is usual to initialize Φ to random values. The D diagonal elements of Σ can be initialized to the variances of the D data dimensions. b In high dimensions it is worth reformulating the covariance of this matrix using the matrix inversion lemma. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 63. Models for style and identity 63 Algorithm 18.2: ML learning of PLDA model PLDA describes the jth of J data examples from the ith of I identities as xij = µ + Φhi + Ψsij + ij , where all terms are the same as in subspace identity model but now we add Ψ, the D ×L within-individual factor matrix and sij the L×1 style variable. Algorithm 18.2: Maximum likelihood learning for PLDA model Input : Training data {xij }I,J i=1,j=1 , numbers of factors, K, L Output: Maximum likelihood estimates of parameters θ = {µ, Φ, Ψ, Σ} begin Initialize θ = θ 0 a // Set mean J µ= I i=1 j=1 xij /IJ repeat µ = [µT , µT . . . µT ]T // compound mean vector, JD×1 Φ = [ΦT , ΦT . . . ΦT ]T // compound factor matrix 1, JD×K Ψ = diag[Ψ, Ψ, . . . Ψ] // compound factor matrix 2, JD×JL Φ = [Φ , Ψ ] // concatenate matrices JD×(K +JL) Σ = diag[Σ, Σ, . . . Σ] // compound covariance, JD×JD // Expectation step for i=1 to I do // compound data vector, JD×1 xi = [xT , xT , . . . , xT ]T i1 i2 iJ µh = (Φ T Σ −1 Φ + I)−1 Φ T Σ −1 (xi − µ ) i Σhi = (Φ T Σ −1 Φ + I)−1 + E[hi ]E[hi ]T for j=1 to J do Sij = [1 . . . K, K +(J − 1)L+1 . . . K +JL] E[hij ] = µh (Sij ) i E[hij hijT ] = Σh (Sij , Sij ) // Extract subvector of mean // Extract submatrix from covariance i end end // Maximization step Φ = I i=1 I i=1 J T j=1 (xij − µ)E[hij ] J j=1 diag (xij − µ)(xij 1 Σ = IJ Φ = Φ (:, 1 : K) Ψ = Φ (:, K + 1 : K + L) // Compute data log likelihood L= I i=1 I i=1 T J j=1 E[hij hijT ] −1 − µ) − [Φ, Ψ]E[hij ](xij − µ)T // Extract original factor matrix // Extract other factor matrix log Normxi [µ , Φ Φ T + Σ ] until No further improvement in L end a Initialize Ψ to random values, other variables as in identity subspace model. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 64. 64 Models for style and identity Algorithm 18.3: ML learning of asymmetric bilinear model This describes the jth data example from the ith identities and the kth styles as xijs = µs + Φs hi + ijs , where the terms have the same interpretation as for the subspace identity model except now there is one set of parameters θ s = {µs , Φs , Σs } per style, s. Algorithm 18.3: Maximum likelihood learning for asymmetric bilinear model Input : Training data {xij }I,J,S i=1,j=1,s=1 , number of factors, K Output: ML estimates of parameters θ = {µ1...S , Φ1...S , Σ1...S } begin Initialize θ = θ 0 for s=1 to S do J µs = I i=1 j=1 xijs /IJ end repeat // Expectation step for i=1 to I do E[hi ] = (I + J S ΦT Σ−1 Φs )−1 S ΦT Σ−1 J (xijs − µs ) s s s s j=1 s=1 s=1 E[hi hT ] = (I + JΦT Σ−1 Φs )−1 + E[hi ]E[hi ]T s s i end // Maximization step for s=1 to S do Φs = −1 I J I T T i=1 j=1 (xijs − µs )E[hi ] i=1 JE[hi hi ] I J T i=1 j=1 diag (xijs − µs )(xijs − µs ) − Φs E[hi ](xijs 1 Σs = IJ end // Compute data log likelihood for s=1 to S do Φs = [ΦT , ΦT . . . ΦT ]T s s s Σs = diag[Σs , Σs , . . . Σs ] for i=1 to I do xis = [xT , xT , . . . , xT ]T i1s i2s iJs xi = [xT , xT , . . . , xT ]T i1 i2 iS end end µ = [µT , µT . . . µT ]T T Φ = [Φ1T , Φ2T . . . ΦS ]T Σ = diag[Σ1 , Σ2 , . . . ΣS ] L= I i=1 // Set mean − µs )T // compound data vector, JSD×1 // compound mean vector, JSD×1 // compound factor matrix, JSD×K // compound covariance, JSD×JSD log Normxi [µ , Φ Φ T + Σ ] until No further improvement in L end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 65. Models for style and identity 65 Algorithm 18.4: Style translation with asymmetric bilinear model To translate a data example from one style to another we first estimate the hidden variable associated with the example, and then use the generative equation to simulate the new style. We cannot know the hidden variable for certain, but we can compute it’s posterior distribution, which has a Gaussian form, and then choose the MAP solution which is the mean of this Gaussian. Algorithm 18.4: Style translation with asymmetric bilinear model Input : Example x in style s1 , model parameters θ Output: Prediction for data x∗ in style s2 begin // Estimate hidden variable E[h] = (I + ΦT1 Σ−1 Φs1 )−1 ΦT1 Σ−1 (x − µs1 ) s s1 s s1 // Predict in different style x∗ = µs2 + Φs2 E[h] end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 66. 66 Temporal models Algorithm 19.1: Kalman filter To define the Kalman filter, we must specify the temporal and measurement models. First, the temporal model relates the states w at times t−1 and t and is given by P r(wt |wt−1 ) = Normwt [µp + Ψwt−1 , Σp ]. where µp is a Dw×1 vector, which represents the mean change in the state and Ψ is a Dw×Dw matrix, which relates the mean of the state at time t to the state at time t−1. This is known as the transition matrix. The transition noise Σp determines how closely related the states are at times t and t−1. Second, the measurement model relates the data xt at time t to the state wt , P r(xt |wt ) = Normxt [µm + Φwt , Σm ]. where µm is a Dx×1 mean vector and Φ is a Dx×Dw matrix relating the Dx×1 measurement vector to the Dw ×1 state. The measurement noise Σm defines additional uncertainty on the measurements that cannot be explained by the state. The Kalman filter is a set of rules for computing the marginal posterior probability P r(wt |x1...t ) based on a normally distributed estimate of the marginal posterior probability P r(wt−1 |x1...t−1 ) at the previous time and a new measurement xt . In this algorithm we denote the mean of the posterior marginal probability as µt−1 and the variance as Σt−1 . Algorithm 19.1: The Kalman filter Input : Measurements {x}T , temporal params µp , Ψ, Σp , measurement params µm , Φ, Σm t=1 Output: Means {µt }T and covariances {Σt }T of marginal posterior distributions t=1 t=1 begin // Initialize mean and covariance µ0 = 0 Σ0 = Σ0 // Typically set to large multiple of identity // For each time step for t=1 to T do // State prediction µ+ = µp + Ψµt−1 // Covariance prediction Σ+ = Σp + ΨΣt−1 ΨT // Compute Kalman gain K = Σ+ ΦT (Σm + ΦΣ+ ΦT )−1 // State update µt = µ+ + K(xt − µm − Φµ+ ) // Covariance update Σt = (I − KΦ)Σ+ end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 67. Temporal models 67 Algorithm 19.2: Fixed interval Kalman smoother The fixed interval smoother consists of a backward set of recursions that estimate the marginal posterior distributions P r(wt |x1...T ) of the state at each time step, taking into account all of the measurements x1...T . In these recursions, the marginal posterior distribution P r(wt |x1...T ) of the state at time t is updated, and, based on this result, the marginal posterior P r(wt−1 |x1...T ) at time t − 1 is updated and so on. In the algorithm, we denote the mean and variance of the marginal posterior P r(wt |x1...T ) at time t by µt|T and Σt|T , respectively The notation µ+|t and Σ+|t denotes the mean and variance of the posterior distribution P r(wt |x1...t−1 ) of the state at time t based on the measurements up to time t − 1 (i.e., what we denoted as µ+ and Σ+ during the forward Kalman filter recursions). Algorithm 19.2: Fixed interval Kalman smoother Input : Means, variances {µt|t , Σt|t , µ+|t , Σ+|t }T , temporal param Ψ t=1 Output: Means {µt|T }T and covariances {Σt|T }T of marginal posterior distributions t=1 t=1 begin // For each time step for t=T-1 to 1 do // Compute gain matrix Ct = Σt|t ΨT Σ−1 +|t // Compute mean µt|T = µt + Ct (µt+1|T − µ+|t ) // Compute variance Σt|T = Σt + Ct (Σt+1|T − Σ+|t )CT t end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 68. 68 Temporal models Algorithm 19.3: Extended Kalman filter The extended Kalman filter (EKF) is designed to cope with more general temporal models, where the relationship between the states at time t is an arbitrary nonlinear function f [•, •] of the state at the previous time step and a stochastic contribution p wt = f [wt−1 , p ], where the covariance of the noise term p is Σp as before. Similarly, it can cope with a nonlinear relationship g[•, •] between the state and the measurements xt = g[wt , m ], where the covariance of m is Σm . The extended Kalman filter works by taking linear approximations to the nonlinear functions at the peak µt of the current estimate using the Taylor expansion. We define the Jacobian matrices, Ψ= ∂f [wt−1 , p ] ∂wt−1 ∂g[wt , Φ= ∂wt Υp = µt−1 ,0 m] Υm µ+ ,0 ∂f [wt−1 , ∂ p ∂g[wt , = ∂ m p] µt−1 ,0 m] , µ+ ,0 where |µ+ ,0 denotes that the derivative is computed at position w = µ+ and = 0. Algorithm 19.3: The extended Kalman filter Input : Measurements {x}T , temporal function f [•, •], measurement function g[•, •] t=1 Output: Means {µt }T and covariances {Σt }T of marginal posterior distributions t=1 t=1 begin // Initialize mean and covariance µ0 = 0 Σ0 = Σ0 // Typically set to large multiple of identity // For each time step for t=1 to T do // State prediction µ+ = f [µt−1 , 0] // Covariance prediction Σ+ = ΨΣt−1 ΨT + Υp Σp ΥT p // Compute Kalman gain K = Σ+ ΦT (Υm Σm ΥT + ΦΣ+ ΦT )−1 m // State update µt = µ+ + K(xt − g[µ+ , 0]) // Covariance update Σt = (I − KΦ)Σ+ end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 69. Temporal models 69 Algorithm 19.4: Iterated extended Kalman filter The iterated extended Kalman filter passes Q times through the dataset, repeating the computations of the extended Kalman filter. At each iteration it linearizes around the previous estimate of the state, with the idea that the linear approximation will get better and better. We define the initial Jacobian matrices as before: Ψ= ∂f [wt−1 , p ] ∂wt−1 ∂g[wt , Φ0 = ∂wt Υp = µt−1 ,0 m] Υ0 m µ+ ,0 ∂f [wt−1 , ∂ p p] µt−1 ,0 ∂g[wt , = ∂ m m] ∂g[wt , ∂ m m] . µ+ ,0 However, on the q th iteration, we use the Jacobians Φq = ∂g[wt , ∂wt m] µq−1 ,0 t Υq = m , µq−1 ,0 t where µq−1 is the estimate of the state at the tth time step on the q − 1th iteration. t Algorithm 19.4: The iterated extended Kalman filter Input : Measurements {x}T , temporal function f [•, •], measurement function g[•, •] t=1 Output: Means {µt }T and covariances {Σt }T of marginal posterior distributions t=1 t=1 begin // For each iteration for q=0 to Q do // Initialize mean and covariance µ0 = 0 Σ0 = Σ0 // Typically set to large multiple of identity // For each time step for t=1 to T do // State prediction µ+ = f [µt−1 , 0] // Covariance prediction Σ+ = ΨΣt−1 ΨT + Υp Σp ΥT p // Compute Kalman gain K = Σ+ ΦqT (Υq Σm ΥqT + Φq Σ+ ΦqT )−1 m m // State update µq = µ+ + K(xt − g[µ+ , 0]) t // Covariance update Σq = (I − KΦq )Σ+ t end end end This algorithm can be improved by running the fixed interval smoother inbetween each iteration and re-linearizing around the smoothed estimates. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 70. 70 Temporal models Algorithm 19.5: Unscented Kalman filter The unscented filter is an alternative to the extended Kalman filter that works by approximating the Gaussian state distribution as a set of particles with the same mean and covariance, passing these particles through the non-linear temporal / measurement equations and then recomputing the mean and covariance based on the new positions of these particles. In the example below, we assume that the state has dimensions Dw and use 2Dw + 1 particles to approximate the world state. Algorithm 19.5: The unscented Kalman filter Input : Measurements {x}T , temporal, measurement functions f [•, •], g[•, •], weight a0 t=1 Output: Means {µt }T and covariances {Σt }T of marginal posterior distributions t=1 t=1 begin // For each time step for t=1 to T do // Approximate state with particles ˆ w[0] = µt−1 for j=1 to Dw do 1/2 Dw Σt−1 ej 1−a0 ˆ w[j] = µt−1 + ˆ w[Dw +j] = µt−1 − 1/2 Dw Σt−1 ej 1−a0 aj = (1 − a0 )/(2Dw ) end // Pass through measurement eqn and compute predicted mean and covariance 2Dw ˆ µ+ = j=0 aj f[w[j] ] 2Dw ˆ ˆ Σ+ = j=0 aj (f[w[j] ] − µ+ )(f[w[j] ] − µ+ )T + Σp // Approximate predicted state with particles ˆ w[0] = µ+ for j=1 to Dw do ˆ w[j] = µ+ + 1/2 Dw Σ+ ej 1−a0 ˆ w[Dw +j] = µ+ − 1/2 Dw Σ+ ej 1−a0 end // Pass through measurement equation for j=0 to 2Dw do ˆ ˆ x[j] = g[w[j] ] end // Compute predicted measurement state and covariance ˆ µx = 2Dw aj x[j] j=0 2Dw Σx = j=0 aj (ˆ [j] − µx )(ˆ [j] − µx )T + Σm x x // Compute new world state and covariance K= 2Dw j=0 ˆ aj (w[j] − µ+ )T (ˆ [j] − µx )T Σ−1 x x µt = µ+ + K (xt − µx ) Σt = Σ+ − KΣx KT end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 71. Temporal models 71 Algorithm 19.6: Condensation algorithm The condensation algorithm completely does away with the Gaussian representation and represents the distributions entirely as sets of weighted particles, where each particle can be interpreted as a hypothesis about the world state and the weight as the probability of this hypothesis being true. Algorithm 19.6: The condensation algorithm Input : Measurements {x}T , temporal model P r(wt |wt−1 ), measurement model P r(xt |wt ) t=1 [j] [j] Output: Weights {at }T , hypotheses {wt }T t=1 t=1 begin // Initialise weights to equal a0 = [1/J, 1/J, . . . , 1/J] // Initialize hypotheses to plausible values for state for j=1 to J do [j] w0 = Initialize[ ] end // For each time step for t=1 to T do // For each particle for j=1 to J do [J] [1] // Sample from 1 . . . J according to probabilities at−1 . . . at−1 n = sampleFromCategorical[at−1 ] // Draw sample from temporal update model ˆ [n] ˆ [j] wt = sample[P r(wt |wt−1 = wt−1 )] // Set weight for particle according to measurement model [j] ˆ [j] at = P r(xt |wt ) end // Normalise weights [j] at = at /( J at ) j=1 end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 72. 72 Models for visual words Algorithm 20.1: Bag of features model The bag of features model treats each object class as a distribution over discrete features f regardless of their position in the image. Assume that there are I images with Ji features in the ith image. Denote the jth feature in the ith image as fij . Then we have Ij P r(Xi |w = n) = Catfij [λn ] j=1 Algorithm 20.1: Learn bag of words model I Input : Features {fij }I,Ji i=1,j=1 , {wi }i=1 , Dirichlet parameter α M Output: Model parameters {λm }m=1 begin // For each object class for n=1 to N do // For each feature for k=1 to L do // Compute number of times feature k observed for object m Ji f Nnk = I i=1 j=1 δ[wi − n]δ[fij − k] end // Compute parameter f f λnk = (Nnk + α − 1)/( K Nnk + Kα − 1) k=1 end end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 73. Models for visual words 73 Algorithm 20.2: Latent Dirichlet Allocation The latent Dirichlet allocation model models a discrete set of features fij ∈ 1 . . . K as a mixture of M categorical distributions (parts), where the categorical distributions themselves are shared, but the mixture weights π i differ from image to image Algorithm 20.2: Learn latent Dirichlet allocation model I Input : Features {fij }I,Ji i=1,j=1 , {wi }i=1 , Dirichlet parameters α, β M Output: Model parameters {λm }m=1 , {π i }I i=1 begin // Initialize categorical parameters θ = θ0 a // Initialize count parameters N(f ) = 0 N(p) = 0 for i=1 to I do for j=1 to J do // Initialize hidden variables pij = randInt[M ] // Update count parameters (f ) (f ) Npij ,fij = Npij ,fij + 1 (p) (f ) Ni,pij = Ni,pij + 1 end end // Main MCMC Loop for t=1 to T do p(t) = MCMCSample[p, f , N(f ) , N(w) , {λm }M , {π i }I , M, K] m=1 i=1 end // Choose samples to use for parameter estimate St = [BurnInTime : SkipTime : Last Sample] for i=1 to I do for m=1 to M do [t] πi,m = Ji j=1 t∈St δ[pij − m] + α end π i = π i / M πim m=1 end for m=1 to M do for k=1 to K do [t] Ji λm,k = I i=1 j=1 t∈St δ[pij − m]δ[fij − k] + β end λm = λm / K λm,k k=1 end end a One way to do this would be to set the categorical parameters {λm }M , {π i }I to random m=1 i=1 values by generating positive random vectors and normalizing them to sum to one. Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.
  • 74. 74 Models for visual words Algorithm 20.2b: Gibbs’ sampling for LDA The preceding algorithm relies on Gibbs sampling from the posterior distribution over the part labels. This can be achieved efficiently using the following method. Algorithm 20.2b: MCMC Sampling for LDA Input : p, f , N(f ) , N(w) , {λm }M , {π i }I , M, K m=1 i=1 Output: Part sample p begin repeat // Choose next feature (a, b) = ChooseFeature[J1 , J2 , . . . JI ] // Remove feature from statistics (f ) (f ) Npab ,fab = Npab ,fab − 1 (p) (p) Na,pab = Npab − 1 for m=1 to M do (f ) (p) qm = (Nm,fab + β)(Na,m + α) (f ) (p) qm = qm /( K (Nm,k + β) N (Na,m + α)) k=1 m=1 end // Normalize q = q/( M qm ) m=1 // Draw new feature pij = DrawCategorical[q] // Replace feature in statistics (f ) (f ) Npab ,fab = Npab ,fab + 1 (p) (p) Na,pab = Npab + 1 until All parts pij updated end Copyright c 2012 by Simon Prince. This latest version of this document can be downloaded from http://www.computervisionmodels.com.