Outline● Introduction● Learn the distance metric from data● The size of K● Procedure of NCA● Experiments● Discussions
Introduction (1/2)● KNN – Simple and effective – Nonlinear decision surface – Non-parametric – Quality improved with more data – Only one parameter, K -> easy for tuning
Introduction (2/2)● Drawbacks of KNN – Computationally expensive: search through the whole training data in the test time – How to define the “distance” properly?● Learn the distance metric from data, and force it to be low rank.
Learn the Distance from Data (1/5)● What is a good distance metric? – The one that minimize (optimize) the cost!● Then, what is the cost? – The expected testing error – Best estimated with leave-one-out (LOO) cross- validation error in the training dataKohavi, Ron (1995). "A study of cross-validation and bootstrap for accuracy estimation and model selection".Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2 (12): 1137–1143. (MorganKaufmann, San Mateo)
Learn the Distance from Data (2/5)● Modeling the LOO error: – Let pij be the probability that point xj is selected as point xis neighbour. – The probability that points are correctly classified when xi is used as the reference is:● To maximize pi for all xi means to minimize LOO error.
Learn the Distance from Data (3/5)● Then, how do we define pij ? – According to the softmax of the distance dij Softmax Function 1 0.9 – Relatively smoother than dij 0.8 0.7 0.6 exp(-X) 0.5 0.4 0.3 0.2 0.1 0 X
Learn the Distance from Data (4/5)● How do we define dij ?● Limit the distance measure within Mahalanobis (quadratic) distance.● That is to say, we project the original feature vectors x into another vector space with q transformation matrix, A
Learn the Distance from Data (5/5)● Substitute the dij in pij :● Now, we have the objective function :● Maximize f(A) w.r.t. A → minimize overall LOO error
The Size of k● For the probability distribution pij :● The perplexity can be used as an estimate for the size of neighbours to be considered, k
Procedure of NCA (1/2)● Use the objective function and its gradient to learn the transformation matrix A and K from the training data, Dtrain(with or without dimension reduction).● Project the test data, Dtest, into the transformed space.● Perform traditional KNN (with K and ADtrain) on the transformed test data, ADtest.
Procedure of NCA (2/2)● Functions used for optimization
Experiments – Datasets (1/2)● 4 from UCI ML Repository, 2 self-made
Experiments – Datasets (2/2)n2d is a mixture of two bivariate normal distributions with different means andcovariance matrices. ring consists of 2-d concentric rings and 8 dimensions ofuniform random noise.
Experiments – Results (1/4)Error rates of KNN and NCA with the same K.It is shown that generally NCA does improve theperformance of KNN.
Experiments – Results (2/4)
Experiments – Results (3/4)● Compare with other classifiers
Discussions (3/8)● Results of Goldberger et al.(40 realizations of 30%/70% splits)
Discussions (4/8)● Results of Goldberger et al.(rank 2 transformation)
Discussions (5/8)● Results of experiments suggest that with the learned distance metric by NCA algorithm, KNN classification can be improved.● NCA also outperforms traditional dimension reduction methods for several datasets.
Discussions (6/8)● Comparing to other classification methods (i.e. LDA and QDA), NCA usually does not give the best accuracy.● Some odd performance on dimension reduction suggests that a further investigation on the optimization algorithm is necessary.
Discussions (7/8)● Optimize a matrix● Can we Optimize these Functions? (Michael L. Overton) – Globally, no. Related problems are NP-hard (Blondell- Tsitsiklas, Nemirovski) – Locally, yes. ● But not by standard methods for nonconvex, smooth optimization ● Steepest descent, BFGS or nonlinear conjugate gradient will typically jam because of nonsmoothness
Discussions (8/8) ● Other methods learn distant metric from data – Discriminant Common Vectors(DCV) ● Similar to NCA, DCV focuses on optimizing the distance metric on certain objective functions – Laplacianfaces(LAP) ● Emphasizes more on dimension reductionJ. Liu and S. Chen ， Discriminant Common Vecotors Versus Neighbourhood ComponentsAnalysis and Laplacianfaces: A comparative study in small sample size problem. Image andVision Computing
Question?
Thank you!
Derive the Objective Function (1/5)● From the assumptions, we have :