Neighborhood Component Analysis 20071108

Neighbourhood Component
Analysis

T.S. Yo

Outline

● Introduction
● Learn the distance metric from data
● The size of K
● Procedure of NCA
● Experiments
● Discussions

Introduction (1/2)

● KNN
– Simple and effective
– Nonlinear decision surface
– Non-parametric
– Quality improved with more data
– Only one parameter, K -> easy for tuning

Introduction (2/2)
● Drawbacks of KNN
– Computationally expensive: search through the
whole training data in the test time
– How to define the “distance” properly?

● Learn the distance metric from data, and
force it to be low rank.

Learn the Distance from Data (1/5)
● What is a good distance metric?
– The one that minimize (optimize) the cost!

● Then, what is the cost?
– The expected testing error
– Best estimated with leave-one-out (LOO) cross-
validation error in the training data
Kohavi, Ron (1995). "A study of cross-validation and bootstrap for accuracy estimation and model selection".
Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2 (12): 1137–1143. (Morgan
Kaufmann, San Mateo)

● Modeling the LOO error:
– Let pij be the probability that point xj is selected as
point xi's neighbour.
– The probability that points are correctly classified
when xi is used as the reference is:

● To maximize pi for all xi means to minimize
LOO error.

● Then, how do we define pij ?
– According to the softmax of the distance dij

Softmax Function
1

0.9

– Relatively smoother than dij
0.8

0.7

0.6

exp(-X)
0.5

0.4

0.3

0.2

0.1

0
X

● How do we define dij ?
● Limit the distance measure within Mahalanobis
(quadratic) distance.

● That is to say, we project the original feature
vectors x into another vector space with q
transformation matrix, A

● Substitute the dij in pij :

● Now, we have the objective function :

● Maximize f(A) w.r.t. A → minimize overall
LOO error

The Size of k
● For the probability distribution pij :

● The perplexity can be used as an estimate for
the size of neighbours to be considered, k

Procedure of NCA (1/2)
● Use the objective function and its gradient to
learn the transformation matrix A and K from
the training data, Dtrain(with or without dimension
reduction).
● Project the test data, Dtest, into the transformed
space.
● Perform traditional KNN (with K and ADtrain) on
the transformed test data, ADtest.

Procedure of NCA (2/2)
● Functions used for optimization

Experiments – Datasets (1/2)
● 4 from UCI ML Repository, 2 self-made

Experiments – Datasets (2/2)

n2d is a mixture of two bivariate normal distributions with different means and
covariance matrices. ring consists of 2-d concentric rings and 8 dimensions of
uniform random noise.

Experiments – Results (1/4)

Error rates of KNN and NCA with the same K.
It is shown that generally NCA does improve the
performance of KNN.

● Compare with
other classifiers

● Rank 2
dimension
reduction

Discussions (1/8)
● Rank 2 transformation for wine

Discussions (2/8)
● Rank 1 transformation for n2d

Discussions (3/8)
● Results of
Goldberger
et al.
(40 realizations of
30%/70% splits)

Discussions
(4/8)

● Results of
Goldberger
et al.
(rank 2
transformation)

Discussions (5/8)
● Results of experiments suggest that with the
learned distance metric by NCA algorithm, KNN
classification can be improved.

● NCA also outperforms traditional dimension
reduction methods for several datasets.

Discussions (6/8)
● Comparing to other classification methods (i.e.
LDA and QDA), NCA usually does not give the
best accuracy.

● Some odd performance on dimension reduction
suggests that a further investigation on the
optimization algorithm is necessary.

Discussions (7/8)
● Optimize a matrix
●
Can we Optimize these Functions? (Michael L. Overton)
– Globally, no. Related problems are NP-hard (Blondell-
Tsitsiklas, Nemirovski)
– Locally, yes.
●
But not by standard methods for nonconvex,
smooth optimization
●
Steepest descent, BFGS or nonlinear conjugate
gradient will typically jam because of nonsmoothness

Discussions (8/8)
● Other methods learn distant metric from data
– Discriminant Common Vectors(DCV)
● Similar to NCA, DCV focuses on optimizing the distance
metric on certain objective functions

– Laplacianfaces(LAP)
● Emphasizes more on dimension reduction

J. Liu and S. Chen ， Discriminant Common Vecotors Versus Neighbourhood Components
Analysis and Laplacianfaces: A comparative study in small sample size problem. Image and
Vision Computing

Derive the Objective Function (1/5)
● From the assumptions, we have :

Neighborhood Component Analysis 20071108

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Neighborhood Component Analysis 20071108

Similar to Neighborhood Component Analysis 20071108 (20)

More from Ting-Shuo Yo

More from Ting-Shuo Yo (9)

Recently uploaded

Recently uploaded (20)

Neighborhood Component Analysis 20071108