Neighborhood Component Analysis 20071108


Published on

An introduction to

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Neighborhood Component Analysis 20071108

  1. 1. Neighbourhood Component Analysis T.S. Yo
  2. 2. References
  3. 3. Outline● Introduction● Learn the distance metric from data● The size of K● Procedure of NCA● Experiments● Discussions
  4. 4. Introduction (1/2)● KNN – Simple and effective – Nonlinear decision surface – Non-parametric – Quality improved with more data – Only one parameter, K -> easy for tuning
  5. 5. Introduction (2/2)● Drawbacks of KNN – Computationally expensive: search through the whole training data in the test time – How to define the “distance” properly?● Learn the distance metric from data, and force it to be low rank.
  6. 6. Learn the Distance from Data (1/5)● What is a good distance metric? – The one that minimize (optimize) the cost!● Then, what is the cost? – The expected testing error – Best estimated with leave-one-out (LOO) cross- validation error in the training dataKohavi, Ron (1995). "A study of cross-validation and bootstrap for accuracy estimation and model selection".Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2 (12): 1137–1143. (MorganKaufmann, San Mateo)
  7. 7. Learn the Distance from Data (2/5)● Modeling the LOO error: – Let pij be the probability that point xj is selected as point xis neighbour. – The probability that points are correctly classified when xi is used as the reference is:● To maximize pi for all xi means to minimize LOO error.
  8. 8. Learn the Distance from Data (3/5)● Then, how do we define pij ? – According to the softmax of the distance dij Softmax Function 1 0.9 – Relatively smoother than dij 0.8 0.7 0.6 exp(-X) 0.5 0.4 0.3 0.2 0.1 0 X
  9. 9. Learn the Distance from Data (4/5)● How do we define dij ?● Limit the distance measure within Mahalanobis (quadratic) distance.● That is to say, we project the original feature vectors x into another vector space with q transformation matrix, A
  10. 10. Learn the Distance from Data (5/5)● Substitute the dij in pij :● Now, we have the objective function :● Maximize f(A) w.r.t. A → minimize overall LOO error
  11. 11. The Size of k● For the probability distribution pij :● The perplexity can be used as an estimate for the size of neighbours to be considered, k
  12. 12. Procedure of NCA (1/2)● Use the objective function and its gradient to learn the transformation matrix A and K from the training data, Dtrain(with or without dimension reduction).● Project the test data, Dtest, into the transformed space.● Perform traditional KNN (with K and ADtrain) on the transformed test data, ADtest.
  13. 13. Procedure of NCA (2/2)● Functions used for optimization
  14. 14. Experiments – Datasets (1/2)● 4 from UCI ML Repository, 2 self-made
  15. 15. Experiments – Datasets (2/2)n2d is a mixture of two bivariate normal distributions with different means andcovariance matrices. ring consists of 2-d concentric rings and 8 dimensions ofuniform random noise.
  16. 16. Experiments – Results (1/4)Error rates of KNN and NCA with the same K.It is shown that generally NCA does improve theperformance of KNN.
  17. 17. Experiments – Results (2/4)
  18. 18. Experiments – Results (3/4)● Compare with other classifiers
  19. 19. Experiments – Results (4/4) ● Rank 2 dimension reduction
  20. 20. Discussions (1/8)● Rank 2 transformation for wine
  21. 21. Discussions (2/8)● Rank 1 transformation for n2d
  22. 22. Discussions (3/8)● Results of Goldberger et al.(40 realizations of 30%/70% splits)
  23. 23. Discussions (4/8)● Results of Goldberger et al.(rank 2 transformation)
  24. 24. Discussions (5/8)● Results of experiments suggest that with the learned distance metric by NCA algorithm, KNN classification can be improved.● NCA also outperforms traditional dimension reduction methods for several datasets.
  25. 25. Discussions (6/8)● Comparing to other classification methods (i.e. LDA and QDA), NCA usually does not give the best accuracy.● Some odd performance on dimension reduction suggests that a further investigation on the optimization algorithm is necessary.
  26. 26. Discussions (7/8)● Optimize a matrix● Can we Optimize these Functions? (Michael L. Overton) – Globally, no. Related problems are NP-hard (Blondell- Tsitsiklas, Nemirovski) – Locally, yes. ● But not by standard methods for nonconvex, smooth optimization ● Steepest descent, BFGS or nonlinear conjugate gradient will typically jam because of nonsmoothness
  27. 27. Discussions (8/8) ● Other methods learn distant metric from data – Discriminant Common Vectors(DCV) ● Similar to NCA, DCV focuses on optimizing the distance metric on certain objective functions – Laplacianfaces(LAP) ● Emphasizes more on dimension reductionJ. Liu and S. Chen , Discriminant Common Vecotors Versus Neighbourhood ComponentsAnalysis and Laplacianfaces: A comparative study in small sample size problem. Image andVision Computing
  28. 28. Question?
  29. 29. Thank you!
  30. 30. Derive the Objective Function (1/5)● From the assumptions, we have :
  31. 31. Derive the Objective Function (2/5)
  32. 32. Derive the Objective Function (3/5)
  33. 33. Derive the Objective Function (4/5)
  34. 34. Derive the Objective Function (5/5)