SlideShare a Scribd company logo
Distance Metric Learning:
A Comprehensive Survey
Liu Yang
Advisor: Rong Jin
May 8th, 2006
Outline
 Introduction
 Supervised Global Distance Metric Learning
 Supervised Local Distance Metric Learning
 Unsupervised Distance Metric Learning
 Distance Metric Learning based on SVM
 Kernel Methods for Distance Metrics Learning
 Conclusions
Introduction
 Definition
 Distance Metric learning is to learn a distance metric for the
input space of data from a given collection of pair of
similar/dissimilar points that preserves the distance relation
among the training data pairs.
 Importance
 Many machine learning algorithms, heavily rely on the
distance metric for the input data patterns. e.g. kNN
 A learned metric can significantly improve the performance
in classification, clustering and retrieval tasks:
e.g. KNN classifier, spectral clustering, content-based
image retrieval (CBIR).
Contributions of this Survey
 Review distance metric learning under different learning
conditions
 supervised learning vs. unsupervised learning
 learning in a global sense vs. in a local sense
 distance matrix based on linear kernel vs. nonlinear
kernel
 Discuss central techniques of distance metric learning
 K nearest neighbor
 dimension reduction
 semidefinite programming
 kernel learning
 large margin classification
Supervised
Distance Metric Learning
Local
Local Adaptive Distance
Metric Learning
Neighborhood Components Analysis
Relevant Component Analysis
Unsupervised
Distance Metric Learning Nonlinear embedding
LLE, ISOMAP, Laplacian Eigenmaps
Distance Metric Learning
based on SVM
Large Margin Nearest Neighbor
Based Distance Metric Learning
Cast Kernel Margin
Maximization into a SDP problem
Kernel Methods for
Distance Metrics Learning
Kernel Alignment with SDP
Learning with Idealized Kernel
Linear embedding
PCA, MDS
Global Distance Metric Learning
by Convex Programming
Outline
 Introduction
 Supervised Global Distance Metric Learning
 Supervised Local Distance Metric Learning
 Unsupervised Distance Metric Learning
 Distance Metric Learning based on SVM
 Kernel Methods for Distance Metrics Learning
Supervised Global Distance Metric
Learning (Xing et al. 2003)
 Goal : keep all the data points within the same classes close,
while separating all the data points from different classes.
 Formulate as a constrained convex programming problem
 minimize the distance between the data pairs in S
 Subject to data pairs in D are well separated
2
2
A
Equivalence constraints: {( , ) | and belong to the same class}
Inequivalence constraints: {( , ) | and belong to different classes},
d ( , ) ( ) ( ), is the distanc
i j i j
i j i j
T m m
A
S x x x x
D x x x x
x y x y x y A x y A S 



      e metric
Global Distance Metric Learning (Cont’d)
 A is positive semi-definite
 Ensure the negativity and the triangle inequality of the metric
 The number of parameters is quadratic in the number of features
 Difficult to scale to a large number of features
 Simplify the computation
2 2
( , ) ( , )
min . . 0, 1
m m
i j i j
i j i j
A R
x x S x x D
A A
x x s t A x x


 
   
 

(a) Data Dist. of the original dataset (b) Data scaled by the global metric
Global Distance Metric Learning:
Example I
 Keep all the data points within the same classes close
 Separate all the data points from different classes
Global Distance Metric Learning:
Example II
 Diagonalize distance metric A can simplify computation, but
could lead to disastrous results
(a) Original data (c) Rescaling by learned
diagonal A
(b) rescaling by learned
full A
(a) Data Dist. of the original dataset
Multimodal data distributions prevent global distance metrics
from simultaneously satisfying constraints on within-class
compactness and between-class separability.
(b) Data scaled by the global metric
Problems with Global Distance
Metric Learning
Outline
 Introduction
 Supervised Global Distance Metric Learning
 Supervised Local Distance Metric Learning
 Unsupervised Distance Metric Learning
 Distance Metric Learning based on SVM
 Kernel Methods for Distance Metrics Learning
 Conclusions
Supervised Local Distance Metric
Learning
 Local Adaptive Distance Metric Learning
 Local Feature Relevance
 Locally Adaptive Feature Relevance Analysis
 Local Linear Discriminative Analysis
 Neighborhood Components Analysis
 Relevant Component Analysis
Local Adaptive Distance Metric
Learning
 K Nearest Neighbor Classifier
   
 
0
0 0
1 1
0
( )
0
( ) : nearest neighbors of
, , , , :training examples
1
( )
0 . .
1
Pr( | ) ( )
( ) i
n n
i
i
i
x N x
N x x
x y x y
y j
y j
o w
j x y j
N x




 
 





 Modified local neighborhood by a distance metric
 Elongate the distance along the dimensions where
the class labels change rapidly
 Squeeze the distance along the dimensions that are
almost independent from the class labels
 Assumption of KNN
 Pr(y|x) in the local NN is constant or smooth
 However, this is not necessarily true!
 Near class boundaries
 Irrelevant dimensions
Local Adaptive Distance Metric
Learning
Local Feature Relevance
[J. Friedman,1994]
(x)p(x)dx
Ef f
 
i
[ | ] (x)p(x|x =z)dx,
i
E f x z f
   i
p(x) ( )
p(x|x =z) =
p(x') ( ' )
i
i
x z
x z





2 2 2 2
( ) [( (x)-E ) | ] [( (x)-E( (x)| ) | ] ( [ | ])
i i i i i
I z E f f x z E f f x z x z Ef E f x z
       
2 2 2
1
( ) ( )/ ( )
p
i i i k k
k
r z I z I z

 
1
( , , )
m
z z z



i
x z

i
x z

x = z

 Assume least-squared estimate for predicting f(x) is
 Conditioned at , then the least-squared estimate of f(x)
 The improvement in prediction error with knowing
 Consider , a measure of relative influence of the
ith input variable to the variation of f(x) at is given by
Locally Adaptive Feature Relevance
Analysis [C. Domeniconi, 2002]
2
0
0
1 0
[ ( | X) ( | x )]
(X, x )
( | x )
J
j
p j p j
r
p j


 
0
x
 Use a Chi-squared distance analysis to compute metric for
producing a neighborhood, in which
 The posterior probabilities are approximately constant
 Highly adaptive to query locations
 Chi-squared distance between the true and estimated posterior
at the test point
 Use the Chi-squared distance for feature relevance:
---- to tell to which extent the ith dimension can be relied on for
predicting p(j| )
0
x
Local Relevance Measure
in ith Dimension
2
i
1
[Pr( | ) Pr( | )]
r (z) =
Pr( | )
J
i i
j
i i
j z j x z
j x z



 


i
r (z)
Pr( | )
i i
j x z


is a conditional expectation of p(j|x)
Pr( | ) (Pr(j|x) | )
i i i i
j x z E x z

  
i
r (z) 0
x
 The closer is to p(j|z), the more information the ith
dimension provides for predicting p(j|z)
 measures the distance between Pr(j|z) and the conditional
expectation of Pr(j|x) at location z
 Calculate for each point z in the neighborhood of
0
i 0 0 1 0 0
0
1
( )
w ( ) ,where ( ) (max ( )) ( )
( )
t= 1 or 2, corresponds to linear and quadratic weighting.
t
q
i
i j j i
q
t
l
l
R x
x R x r x r x
R x
 


  

q
2
i=1
(x,y) = ( )
i i i
D w x y


Locally Adaptive Feature
Relevance Analysis
0
0
(x )
1
(x ) ( )
i i
z N
r r z
K


  0
(x )
N is the neighborhood of 0
x
 A local relevance measure in dimension i
 Relative relevance
 Weighted distance
Local Linear Discriminative Analysis
[T. Hastie et al. 1996]
Sb : the between-class covariance matrix
Sw : the within-class covariance matrix
-1
T = Sw Sb.
 LDA finds principle eigenvectors of matrix
 to keep patterns from the same class close
 separate patterns from different classes apart
 LDA metric : stacking principle eigenvectors of T together
Local Linear
Discriminative Analysis
1 1 1 1
2 2 2 2
[ I]
w w b w w
S S S S S

   
  

0
x
0
x


 Need local adaptation of the nearest neighbor metric
 Initialize as identical matrix
 Given a testing point , iterate below two steps:
 Estimate Sb and Sw based on the local neighbor
of measured by
 Form a local metric behaving like LDA metric
is a small tuning parameter to prevent neighborhoods
extending to infinity
 Local Sb shows the inconsistency of the class centriods
 The estimated metric
 shrinks the neighborhood in directions in which the local class
centroids differ to produce a neighborhood in which the class
centriod coincide
 shrinks neighborhoods in directions orthogonal to these local
decision boundaries, and elongates them parallel to the boundaries.
Local Linear Discriminative Analysis
Overfitting, Scalability problem, # parameters is quadratic in #features.
Neighborhood Components Analysis
[J. Goldberger et al. 2005]
i
x
2
i j
i 2
i k
exp( Ax Ax )
Here C { | },
exp( Ax Ax )
i j ij
k i
j c c p

 
  
 

n
i
i=1
f(A) = p ,

i
p
i
ij
j C
p

 
 NCA learns a Mahalanobis distance metric for the KNN
classifier by maximizing the leave-one-out cross validation.
 The probability of classifying correctly,
weighted counting involving pairwise distance
 The expected number of correctly classification points:
RCA [N. Shen et al. 2002]
unlabeled data labeled data
chuklet data
^ ^ ^
T
j j
ji ji
1 1
1
C (x m )(x m ) ,
j
n
k
j i
p  
  

1
^ 2
y C x


j
^
n
j
ji i=1
chunklet j : {x } ,with mean m
 Constructs a Mahalanobis distance metric based on a sum of
 in-chunklet covariance matrices
 Chunklet : data have same but unknown class labels
 Sum of in-chunklet covariance matrices for p points in k chunklets:
 Apply linear transformation
Information maximization
under chunklet constraints
[A. Bar-Hillel etal, 2003]
 Maximizes the mutual information I(X,Y)
 Constraints: within-chunklet compactness
T
2
j
B
1 1 B
Let B =A A, (*) can be further written into
1
max | B| s.t. m , B 0
p
j
n
k
ji
j i
x K
 
 
 
2
y
j
1 1
y
j
1
max I(X,Y) s.t. m . (*)
p
m is the transformed mean in the jth chunklet.
K is threshold constant.
j
n
k
ji
f F
j i
y K

 
 

RCA algorithm applied to
synthetic Gaussian data
 (a) The fully labeled data set with 3 classes.
 (b) Same data unlabeled; classes' structure is less evident.
 (c) The set of chunklets
 (d) The centered chunklets, and their empirical covariance.
 (e) The RCA transformation applied to the chunklets. (centered)
 (f) The original data after applying the RCA transformation.
Outline
 Introduction
 Supervised Global Distance Metric Learning
 Supervised Local Distance Metric Learning
 Unsupervised Distance Metric Learning
 Distance Metric Learning based on SVM
 Kernel Methods for Distance Metrics Learning
 Conclusions
Unsupervised Distance Metric Learning
 A Unified Framework for Dimension Reduction
 Solution 1
 Solution 2
linear nonlinear
Global PCA, MDS ISOMAP
Local LLE, Laplacian Eigenmap
 Most dimension reduction approaches are to learn a distance
metric without label information. e.g. PCA
 I will present five methods for dimensionality reduction.
Dimensionality Reduction Algorithms
 PCA finds the subspace that best preserves the variance of the data.
 MDS finds the subspace that best preserves the interpoint distances.
 Isomap finds the subspace that best preserves the geodesic
interpoint distances. [Tenenbaum et al, 2000].
 LLE finds the subspace that best preserves the local linear structure
of the data [Roweis and Saul, 2000].
 Laplacian Eigenmap finds the subspace that best preserves local
neighborhood information in the adjacency graph [M. Belkin and P.
Niyogi,2003].
Multidimensional Scaling (MDS)
 MDS finds the rank m projection that best preserves the
inter-point distance given by matrix D
 Converts distances to inner products
 Calculate X
 Rank m projections Y closet to X
 Given the distance matrix among
cities, MDS produces this map:
1
MDS MDS 2
m m
Y= V ( )

T
B= (D)= X X

1
MDS MDS 2
MDS MDS
[V , ] =eig(B)
X = V ( )


PCA (Principal Component Analysis)
1
PCA MDS PCA MDS PCA PCA MDS
2
V XV , , Y ( ) Y
     
=Var(X)
PCA PCA
[V , ]=eig( )

 
PCA
Y = V X
m
 PCA finds the subspace that best preserves the data variance.
 PCA projection of X with rank m
 PCA vs. MDS
 In the Euclidean case, MDS only differs from PCA by
starting with D and calculating X.
A B
Isometric Feature Mapping (ISOMAP)
[Tenenbaum et al, 2000]
 Geodesic :the shortest curve on a manifold
that connects two points on the manifold
e.g. on a sphere, geodesics are great circles
 Geodesic distance: length of the geodesic
 Points far apart measured by geodesic dist.
appear close measured by Euclidean dist.
ISOMAP
 Take a distance matrix as input
 Construct a weighted graph G based on neighborhood relations
 Estimate pairwise geodesic distance by
“a sequence of short hops” on G
 Apply MDS to the geodesic distance matrix
Locally Linear Embedding (LLE)
[Roweis and Saul, 2000]
 LLE finds the subspace that best preserves the local
linear structure of the data
 Assumption: manifold is locally “linear”
Each sample in the input space is a linearly weighted
average of its neighbors.
 A good projection should best preserve this geometric
locality property
LLE
 W: a linear representation of every data point by its neighbors
 Choose W by minimized the reconstruction error
 Calculate a neighborhood preserving mapping Y, by minimizing
the reconstruction error
 Y is given by the eigenvectors of the m lowest nonzero
eigenvalues of matrix
2
n
i ij
i=1 1
ij i ij j i
1
minimizing x W
s.t. W 1, x ; W 0 if x is not a neighbor of x
K
ij
j
n
j
x



  
 

* *
i
W
1
(Y)= y , where W argmin (W)
K
ij ij
i
W y 

  

T
(I-W) (I-W)
 Laplacian Eigenmap finds the subspace that best preserves local
neighborhood information in adjacency graph
 Graph Laplacian: Given a graph G with weight matrix W
D is a diagonal matrix with
L =D –W is the graph Laplacian
 Detailed steps:
 Construct adjacency graph G.
 Weight the edges:
 Generalized eigen-decomposition of
 Embedding : eigenvectors with top m nonzero eigenvalues
Laplacian Eigenmap
[M. Belkin and P. Niyogi,2003]
ii ji
j
D W
 
Lf= Df

ij
W 1, if nodes i and j are connected, and 0 otw.

A Unified Framework for
Dimension Reduction Algorithms
 All use an eigendecomposition to obtain a lower-dimensional
embedding of data lying on a non-linear manifold.
 Normalize affinity matrix
 The embedding of has two alternative solutions
 Solution 1 : (MDS & Isomap)
is the best approximation of in the squared error sense.
 Solution 2 : (LLE & Laplacian Eigenmap)
i it ti
y with y = v
i it t it
e with e = v

^
t t
(H) the m largest positive eigenvalues and eigenvectors v
eig 

i
x
i j
e ,e
^
Hij
ij
j
H ^
H H



Outline
 Introduction
 Supervised Global Distance Metric Learning
 Supervised Local Distance Metric Learning
 Unsupervised Distance Metric Learning
 Distance Metric Learning based on SVM
 Kernel Methods for Distance Metrics Learning
 Conclusions
Distance Metric Learning based on SVM
 Large Margin Nearest Neighbor Based Distance Metric
Learning
 Objective Function
 Reformulation as SDP
 Cast Kernel Margin Maximization into a SDP Problem
 Maximum Margin
 Cast into SDP problem
 Apply to Hard Margin and Soft Margin
 After training
 k=3 target neighbors lie within a smaller radius
 differently labeled inputs lie outside this smaller radius with a
margin of at least one unit distance.
Large Margin Nearest Neighbor
Based Distance Metric Learning
[K. Weinberger et al., 2006]
 Learns a Mahanalobis distance metric in the kNN classification
setting by SDP, that
 Enforces the k-nearest neighbors belong to the same class
 examples from different classes are separated by a large margin
Large Margin Nearest Neighbor
Based Distance Metric Learning
 Cost function:
 Penalize large distances between each input and its target neighbors
 The hinge loss is incurred by differently labeled inputs whose
distances do not exceed the distance from input to any of its target
neighbors by one absolute unit of distance
-> do not threaten to invade each other’s neighborhoods
2 2 2
ij i j ij i j i l 2
2 2
ij ijl
(L) = L(x -x ) C (1 )[1 L(x -x ) L(x -x ) ]
[ ] max(z,0) denotes the standard hinge loss and the constant C > 0.
il
y
z
   

   

 
ij i j
ij j i
y {0,1} indicate whether or not the label y and y match
{0,1} indicate whether x is a target neighbor of x



i
x
Reformulation as SDP
T
i j i j
M
T T
i j i j i l i l
The resulting SDP is :
min (x x ) M(x x ) C (1 )
. . (x x ) M(x x ) (x x ) M(x x ) 1
0,M =0
ij ij il ijl
ij ijl
ijl
ijl
y
s t
  


   
      

 

2
i j i j i j
2
Let L(x -x ) (x -x ) M(x -x ), and introducing slack variable
T
ijl


Cast Kernel Margin Maximization
into a SDP Problem
[G. R. G. Lanckriet et al, 2004]
 Maximum margin : the decision boundary has the maximum
minimum distance from the closest training point.
 Hard Margin: linearly separable
 Soft Margin: nonlinearly separable
 The performance measure, generalized from dual solution of
different maximizing margin problem
T T T
, (K) max 2 ( ( ) ) : 0, y 0
with 0 on the training data w.r.t K. G is Gram matrix.
C
w e G K I C


     

     

Cast into SDP Problem
2 tr 2 tr , tr
K =0
min (K ) s.t. trace(K)=c. Here (K ) =w (K )
S S
w w 


 Hard Margin
 1-norm soft margin
 2-norm soft margin
tr tr ,0 tr
K =0
min (K ) s.t. trace(K)=c. Here (K ) =w (K )
w w 

1 tr 1 tr C,0 tr
K =0
min (K ) s.t. trace(K)=c. Here (K ) =w (K )
S S
w w

K,t, , ,
tr
T T
min
. . trace(K)=c, K =0, 0, 0,
G(K ) y
0
( y) t-2C
tr
n
t
s t
I e
e e
  
 
   
   
 
   
 

 
 
  
 


,
K =0
min (K) . . trace(K) = c
C
w s t
 

Outline
 Introduction
 Supervised Global Distance Metric Learning
 Supervised Local Distance Metric Learning
 Unsupervised Distance Metric Learning
 Distance Metric Learning based on SVM
 Kernel Methods for Distance Metrics Learning
 Conclusions
Kernel Methods for
Distance Metrics Learning
 Learning a good kernel is equivalent to distance metric
learning
 Kernel Alignment
 Kernel Alignment with SDP
 Learning with Idealized Kernel
 Ideal Kernel
 The Idealized Kernel
Kernel Alignment
[N. Cristianini,2001]
T
^ 1 F
1 2
1 1 F
K , yy
A(S, k ,k ) , y { 1}
K ,K
m
m
  
 A measure of similarity between two kernel functions or between
a kernel and a target function
 The inner product between two kernel matrices based on kernel k1
and k2.
 The alignment of K1 and K2 w.r.t S:
 Measure the degree of agreement between a kernel and a given
learning task.
1 2 1 i j 2 i j
F
, 1
K ,K K (x ,x )K (x ,x )
n
i j
 
^
1 2 F
1 2
1 1 2 2
F F
K ,K
A(S, k ,k )
K ,K K ,K

Kernel Alignment with SDP
[G. R. G. Lanckriet et al, 2004]
 Optimizing the alignment between a set of labels and a
kernel matrix using SDP in a transductive setting.
 Optimizing an objective function over the training data
block -> automatic tuning of testing data block
 Introduce A with , this reduces to
T
tr
F
A,K
T
n
max K , yy
A K
. . trace(A) 1, K =0, =0.
K I
s t
 
  
 
 
^
T
1
K
max A( ,K ,yy ) s.t. K =0, trace(K) =1
S 
tr tr,t
ij i j tr t
T
tr,t
K K
K= , where K (x ), (x ) ,i, j =1, ,n n .
K K
 
   
 
 
 

T
K K =A and trace(A) 1


Learning with Idealized Kernel
[J. T. Kwok and I.W. Tsang,2003]
 Idealize a given kernel by making it more similar to the
ideal kernel matrix.
 Ideal kernel:
 Idealized kernel:
 The alignment of will be greater than k, if
are the number of positive and negative samples.
 Under the original distance metric M:
i j
*
i j
i j
1, y(x ) y(x )
k (x ,x )
0, y(x ) y(x )



 



~
*
k = k + k
2

*
2 2
K,K
n n

 
 

~
k
2
~ ~ ~ ij i j
2
ij i j
d y =y
K K 2K
d y y
ii jj ij



   
 


T 2 T
i j i j ij i j i j
k(x , x ) = x Mx , M =0; d (x - x ) M(x - x )


,
n n
 
ij
i j i j
2 T
S
2
B, ,
(x ,x ) (x ,x )
~ ~
2 2
ij ij i j
~
2 2
ij ij i j
1 1
min B , where B= AA
2
, (x ,x )
. . , 0, 0,
, (x ,x )
ij D ij
S D
S D
ij
ij
ij
C
C
N N
d d D
s t
d d S
 
  
 
 

 
 
   
 
 
 

   

 

   

 
Idealized kernel
 We modify
 Search for a matrix A under which
 different classes : pulled apart by an amount of at least
 same class :getting close together.
 Introduce slack variables for error tolerance
2
~
ij i j
2
ij 2
ij i j
d y = y
d
d y y



 
  


2
~
T T
ij i j i j
(x - x ) A A(x - x )
d 

Conclusions
 A comprehensive review, covers:
 Supervised distance metric learning
 Unsupervised distance metric learning
 Maximum margin based distance metric learning
approaches
 Kernel methods towards distance metrics
 Challenge:
 Unsupervised distance metric learning.
 Going local in a principle manner.
 Learn an explicit nonlinear distance metric in the local
sense.
 Efficiency issue.

More Related Content

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Metric Learning Survey Slides

  • 1. Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006
  • 2. Outline  Introduction  Supervised Global Distance Metric Learning  Supervised Local Distance Metric Learning  Unsupervised Distance Metric Learning  Distance Metric Learning based on SVM  Kernel Methods for Distance Metrics Learning  Conclusions
  • 3. Introduction  Definition  Distance Metric learning is to learn a distance metric for the input space of data from a given collection of pair of similar/dissimilar points that preserves the distance relation among the training data pairs.  Importance  Many machine learning algorithms, heavily rely on the distance metric for the input data patterns. e.g. kNN  A learned metric can significantly improve the performance in classification, clustering and retrieval tasks: e.g. KNN classifier, spectral clustering, content-based image retrieval (CBIR).
  • 4. Contributions of this Survey  Review distance metric learning under different learning conditions  supervised learning vs. unsupervised learning  learning in a global sense vs. in a local sense  distance matrix based on linear kernel vs. nonlinear kernel  Discuss central techniques of distance metric learning  K nearest neighbor  dimension reduction  semidefinite programming  kernel learning  large margin classification
  • 5. Supervised Distance Metric Learning Local Local Adaptive Distance Metric Learning Neighborhood Components Analysis Relevant Component Analysis Unsupervised Distance Metric Learning Nonlinear embedding LLE, ISOMAP, Laplacian Eigenmaps Distance Metric Learning based on SVM Large Margin Nearest Neighbor Based Distance Metric Learning Cast Kernel Margin Maximization into a SDP problem Kernel Methods for Distance Metrics Learning Kernel Alignment with SDP Learning with Idealized Kernel Linear embedding PCA, MDS Global Distance Metric Learning by Convex Programming
  • 6. Outline  Introduction  Supervised Global Distance Metric Learning  Supervised Local Distance Metric Learning  Unsupervised Distance Metric Learning  Distance Metric Learning based on SVM  Kernel Methods for Distance Metrics Learning
  • 7. Supervised Global Distance Metric Learning (Xing et al. 2003)  Goal : keep all the data points within the same classes close, while separating all the data points from different classes.  Formulate as a constrained convex programming problem  minimize the distance between the data pairs in S  Subject to data pairs in D are well separated 2 2 A Equivalence constraints: {( , ) | and belong to the same class} Inequivalence constraints: {( , ) | and belong to different classes}, d ( , ) ( ) ( ), is the distanc i j i j i j i j T m m A S x x x x D x x x x x y x y x y A x y A S           e metric
  • 8. Global Distance Metric Learning (Cont’d)  A is positive semi-definite  Ensure the negativity and the triangle inequality of the metric  The number of parameters is quadratic in the number of features  Difficult to scale to a large number of features  Simplify the computation 2 2 ( , ) ( , ) min . . 0, 1 m m i j i j i j i j A R x x S x x D A A x x s t A x x           
  • 9. (a) Data Dist. of the original dataset (b) Data scaled by the global metric Global Distance Metric Learning: Example I  Keep all the data points within the same classes close  Separate all the data points from different classes
  • 10. Global Distance Metric Learning: Example II  Diagonalize distance metric A can simplify computation, but could lead to disastrous results (a) Original data (c) Rescaling by learned diagonal A (b) rescaling by learned full A
  • 11. (a) Data Dist. of the original dataset Multimodal data distributions prevent global distance metrics from simultaneously satisfying constraints on within-class compactness and between-class separability. (b) Data scaled by the global metric Problems with Global Distance Metric Learning
  • 12. Outline  Introduction  Supervised Global Distance Metric Learning  Supervised Local Distance Metric Learning  Unsupervised Distance Metric Learning  Distance Metric Learning based on SVM  Kernel Methods for Distance Metrics Learning  Conclusions
  • 13. Supervised Local Distance Metric Learning  Local Adaptive Distance Metric Learning  Local Feature Relevance  Locally Adaptive Feature Relevance Analysis  Local Linear Discriminative Analysis  Neighborhood Components Analysis  Relevant Component Analysis
  • 14. Local Adaptive Distance Metric Learning  K Nearest Neighbor Classifier       0 0 0 1 1 0 ( ) 0 ( ) : nearest neighbors of , , , , :training examples 1 ( ) 0 . . 1 Pr( | ) ( ) ( ) i n n i i i x N x N x x x y x y y j y j o w j x y j N x             
  • 15.  Modified local neighborhood by a distance metric  Elongate the distance along the dimensions where the class labels change rapidly  Squeeze the distance along the dimensions that are almost independent from the class labels  Assumption of KNN  Pr(y|x) in the local NN is constant or smooth  However, this is not necessarily true!  Near class boundaries  Irrelevant dimensions Local Adaptive Distance Metric Learning
  • 16. Local Feature Relevance [J. Friedman,1994] (x)p(x)dx Ef f   i [ | ] (x)p(x|x =z)dx, i E f x z f    i p(x) ( ) p(x|x =z) = p(x') ( ' ) i i x z x z      2 2 2 2 ( ) [( (x)-E ) | ] [( (x)-E( (x)| ) | ] ( [ | ]) i i i i i I z E f f x z E f f x z x z Ef E f x z         2 2 2 1 ( ) ( )/ ( ) p i i i k k k r z I z I z    1 ( , , ) m z z z    i x z  i x z  x = z   Assume least-squared estimate for predicting f(x) is  Conditioned at , then the least-squared estimate of f(x)  The improvement in prediction error with knowing  Consider , a measure of relative influence of the ith input variable to the variation of f(x) at is given by
  • 17. Locally Adaptive Feature Relevance Analysis [C. Domeniconi, 2002] 2 0 0 1 0 [ ( | X) ( | x )] (X, x ) ( | x ) J j p j p j r p j     0 x  Use a Chi-squared distance analysis to compute metric for producing a neighborhood, in which  The posterior probabilities are approximately constant  Highly adaptive to query locations  Chi-squared distance between the true and estimated posterior at the test point  Use the Chi-squared distance for feature relevance: ---- to tell to which extent the ith dimension can be relied on for predicting p(j| ) 0 x
  • 18. Local Relevance Measure in ith Dimension 2 i 1 [Pr( | ) Pr( | )] r (z) = Pr( | ) J i i j i i j z j x z j x z        i r (z) Pr( | ) i i j x z   is a conditional expectation of p(j|x) Pr( | ) (Pr(j|x) | ) i i i i j x z E x z     i r (z) 0 x  The closer is to p(j|z), the more information the ith dimension provides for predicting p(j|z)  measures the distance between Pr(j|z) and the conditional expectation of Pr(j|x) at location z  Calculate for each point z in the neighborhood of
  • 19. 0 i 0 0 1 0 0 0 1 ( ) w ( ) ,where ( ) (max ( )) ( ) ( ) t= 1 or 2, corresponds to linear and quadratic weighting. t q i i j j i q t l l R x x R x r x r x R x         q 2 i=1 (x,y) = ( ) i i i D w x y   Locally Adaptive Feature Relevance Analysis 0 0 (x ) 1 (x ) ( ) i i z N r r z K     0 (x ) N is the neighborhood of 0 x  A local relevance measure in dimension i  Relative relevance  Weighted distance
  • 20. Local Linear Discriminative Analysis [T. Hastie et al. 1996] Sb : the between-class covariance matrix Sw : the within-class covariance matrix -1 T = Sw Sb.  LDA finds principle eigenvectors of matrix  to keep patterns from the same class close  separate patterns from different classes apart  LDA metric : stacking principle eigenvectors of T together
  • 21. Local Linear Discriminative Analysis 1 1 1 1 2 2 2 2 [ I] w w b w w S S S S S          0 x 0 x    Need local adaptation of the nearest neighbor metric  Initialize as identical matrix  Given a testing point , iterate below two steps:  Estimate Sb and Sw based on the local neighbor of measured by  Form a local metric behaving like LDA metric is a small tuning parameter to prevent neighborhoods extending to infinity
  • 22.  Local Sb shows the inconsistency of the class centriods  The estimated metric  shrinks the neighborhood in directions in which the local class centroids differ to produce a neighborhood in which the class centriod coincide  shrinks neighborhoods in directions orthogonal to these local decision boundaries, and elongates them parallel to the boundaries. Local Linear Discriminative Analysis
  • 23. Overfitting, Scalability problem, # parameters is quadratic in #features. Neighborhood Components Analysis [J. Goldberger et al. 2005] i x 2 i j i 2 i k exp( Ax Ax ) Here C { | }, exp( Ax Ax ) i j ij k i j c c p          n i i=1 f(A) = p ,  i p i ij j C p     NCA learns a Mahalanobis distance metric for the KNN classifier by maximizing the leave-one-out cross validation.  The probability of classifying correctly, weighted counting involving pairwise distance  The expected number of correctly classification points:
  • 24. RCA [N. Shen et al. 2002] unlabeled data labeled data chuklet data ^ ^ ^ T j j ji ji 1 1 1 C (x m )(x m ) , j n k j i p       1 ^ 2 y C x   j ^ n j ji i=1 chunklet j : {x } ,with mean m  Constructs a Mahalanobis distance metric based on a sum of  in-chunklet covariance matrices  Chunklet : data have same but unknown class labels  Sum of in-chunklet covariance matrices for p points in k chunklets:  Apply linear transformation
  • 25. Information maximization under chunklet constraints [A. Bar-Hillel etal, 2003]  Maximizes the mutual information I(X,Y)  Constraints: within-chunklet compactness T 2 j B 1 1 B Let B =A A, (*) can be further written into 1 max | B| s.t. m , B 0 p j n k ji j i x K       2 y j 1 1 y j 1 max I(X,Y) s.t. m . (*) p m is the transformed mean in the jth chunklet. K is threshold constant. j n k ji f F j i y K      
  • 26. RCA algorithm applied to synthetic Gaussian data  (a) The fully labeled data set with 3 classes.  (b) Same data unlabeled; classes' structure is less evident.  (c) The set of chunklets  (d) The centered chunklets, and their empirical covariance.  (e) The RCA transformation applied to the chunklets. (centered)  (f) The original data after applying the RCA transformation.
  • 27. Outline  Introduction  Supervised Global Distance Metric Learning  Supervised Local Distance Metric Learning  Unsupervised Distance Metric Learning  Distance Metric Learning based on SVM  Kernel Methods for Distance Metrics Learning  Conclusions
  • 28. Unsupervised Distance Metric Learning  A Unified Framework for Dimension Reduction  Solution 1  Solution 2 linear nonlinear Global PCA, MDS ISOMAP Local LLE, Laplacian Eigenmap  Most dimension reduction approaches are to learn a distance metric without label information. e.g. PCA  I will present five methods for dimensionality reduction.
  • 29. Dimensionality Reduction Algorithms  PCA finds the subspace that best preserves the variance of the data.  MDS finds the subspace that best preserves the interpoint distances.  Isomap finds the subspace that best preserves the geodesic interpoint distances. [Tenenbaum et al, 2000].  LLE finds the subspace that best preserves the local linear structure of the data [Roweis and Saul, 2000].  Laplacian Eigenmap finds the subspace that best preserves local neighborhood information in the adjacency graph [M. Belkin and P. Niyogi,2003].
  • 30. Multidimensional Scaling (MDS)  MDS finds the rank m projection that best preserves the inter-point distance given by matrix D  Converts distances to inner products  Calculate X  Rank m projections Y closet to X  Given the distance matrix among cities, MDS produces this map: 1 MDS MDS 2 m m Y= V ( )  T B= (D)= X X  1 MDS MDS 2 MDS MDS [V , ] =eig(B) X = V ( )  
  • 31. PCA (Principal Component Analysis) 1 PCA MDS PCA MDS PCA PCA MDS 2 V XV , , Y ( ) Y       =Var(X) PCA PCA [V , ]=eig( )    PCA Y = V X m  PCA finds the subspace that best preserves the data variance.  PCA projection of X with rank m  PCA vs. MDS  In the Euclidean case, MDS only differs from PCA by starting with D and calculating X.
  • 32. A B Isometric Feature Mapping (ISOMAP) [Tenenbaum et al, 2000]  Geodesic :the shortest curve on a manifold that connects two points on the manifold e.g. on a sphere, geodesics are great circles  Geodesic distance: length of the geodesic  Points far apart measured by geodesic dist. appear close measured by Euclidean dist.
  • 33. ISOMAP  Take a distance matrix as input  Construct a weighted graph G based on neighborhood relations  Estimate pairwise geodesic distance by “a sequence of short hops” on G  Apply MDS to the geodesic distance matrix
  • 34. Locally Linear Embedding (LLE) [Roweis and Saul, 2000]  LLE finds the subspace that best preserves the local linear structure of the data  Assumption: manifold is locally “linear” Each sample in the input space is a linearly weighted average of its neighbors.  A good projection should best preserve this geometric locality property
  • 35. LLE  W: a linear representation of every data point by its neighbors  Choose W by minimized the reconstruction error  Calculate a neighborhood preserving mapping Y, by minimizing the reconstruction error  Y is given by the eigenvectors of the m lowest nonzero eigenvalues of matrix 2 n i ij i=1 1 ij i ij j i 1 minimizing x W s.t. W 1, x ; W 0 if x is not a neighbor of x K ij j n j x          * * i W 1 (Y)= y , where W argmin (W) K ij ij i W y       T (I-W) (I-W)
  • 36.  Laplacian Eigenmap finds the subspace that best preserves local neighborhood information in adjacency graph  Graph Laplacian: Given a graph G with weight matrix W D is a diagonal matrix with L =D –W is the graph Laplacian  Detailed steps:  Construct adjacency graph G.  Weight the edges:  Generalized eigen-decomposition of  Embedding : eigenvectors with top m nonzero eigenvalues Laplacian Eigenmap [M. Belkin and P. Niyogi,2003] ii ji j D W   Lf= Df  ij W 1, if nodes i and j are connected, and 0 otw. 
  • 37. A Unified Framework for Dimension Reduction Algorithms  All use an eigendecomposition to obtain a lower-dimensional embedding of data lying on a non-linear manifold.  Normalize affinity matrix  The embedding of has two alternative solutions  Solution 1 : (MDS & Isomap) is the best approximation of in the squared error sense.  Solution 2 : (LLE & Laplacian Eigenmap) i it ti y with y = v i it t it e with e = v  ^ t t (H) the m largest positive eigenvalues and eigenvectors v eig   i x i j e ,e ^ Hij ij j H ^ H H   
  • 38. Outline  Introduction  Supervised Global Distance Metric Learning  Supervised Local Distance Metric Learning  Unsupervised Distance Metric Learning  Distance Metric Learning based on SVM  Kernel Methods for Distance Metrics Learning  Conclusions
  • 39. Distance Metric Learning based on SVM  Large Margin Nearest Neighbor Based Distance Metric Learning  Objective Function  Reformulation as SDP  Cast Kernel Margin Maximization into a SDP Problem  Maximum Margin  Cast into SDP problem  Apply to Hard Margin and Soft Margin
  • 40.  After training  k=3 target neighbors lie within a smaller radius  differently labeled inputs lie outside this smaller radius with a margin of at least one unit distance. Large Margin Nearest Neighbor Based Distance Metric Learning [K. Weinberger et al., 2006]  Learns a Mahanalobis distance metric in the kNN classification setting by SDP, that  Enforces the k-nearest neighbors belong to the same class  examples from different classes are separated by a large margin
  • 41. Large Margin Nearest Neighbor Based Distance Metric Learning  Cost function:  Penalize large distances between each input and its target neighbors  The hinge loss is incurred by differently labeled inputs whose distances do not exceed the distance from input to any of its target neighbors by one absolute unit of distance -> do not threaten to invade each other’s neighborhoods 2 2 2 ij i j ij i j i l 2 2 2 ij ijl (L) = L(x -x ) C (1 )[1 L(x -x ) L(x -x ) ] [ ] max(z,0) denotes the standard hinge loss and the constant C > 0. il y z             ij i j ij j i y {0,1} indicate whether or not the label y and y match {0,1} indicate whether x is a target neighbor of x    i x
  • 42. Reformulation as SDP T i j i j M T T i j i j i l i l The resulting SDP is : min (x x ) M(x x ) C (1 ) . . (x x ) M(x x ) (x x ) M(x x ) 1 0,M =0 ij ij il ijl ij ijl ijl ijl y s t                     2 i j i j i j 2 Let L(x -x ) (x -x ) M(x -x ), and introducing slack variable T ijl  
  • 43. Cast Kernel Margin Maximization into a SDP Problem [G. R. G. Lanckriet et al, 2004]  Maximum margin : the decision boundary has the maximum minimum distance from the closest training point.  Hard Margin: linearly separable  Soft Margin: nonlinearly separable  The performance measure, generalized from dual solution of different maximizing margin problem T T T , (K) max 2 ( ( ) ) : 0, y 0 with 0 on the training data w.r.t K. G is Gram matrix. C w e G K I C                
  • 44. Cast into SDP Problem 2 tr 2 tr , tr K =0 min (K ) s.t. trace(K)=c. Here (K ) =w (K ) S S w w     Hard Margin  1-norm soft margin  2-norm soft margin tr tr ,0 tr K =0 min (K ) s.t. trace(K)=c. Here (K ) =w (K ) w w   1 tr 1 tr C,0 tr K =0 min (K ) s.t. trace(K)=c. Here (K ) =w (K ) S S w w  K,t, , , tr T T min . . trace(K)=c, K =0, 0, 0, G(K ) y 0 ( y) t-2C tr n t s t I e e e                                  , K =0 min (K) . . trace(K) = c C w s t   
  • 45. Outline  Introduction  Supervised Global Distance Metric Learning  Supervised Local Distance Metric Learning  Unsupervised Distance Metric Learning  Distance Metric Learning based on SVM  Kernel Methods for Distance Metrics Learning  Conclusions
  • 46. Kernel Methods for Distance Metrics Learning  Learning a good kernel is equivalent to distance metric learning  Kernel Alignment  Kernel Alignment with SDP  Learning with Idealized Kernel  Ideal Kernel  The Idealized Kernel
  • 47. Kernel Alignment [N. Cristianini,2001] T ^ 1 F 1 2 1 1 F K , yy A(S, k ,k ) , y { 1} K ,K m m     A measure of similarity between two kernel functions or between a kernel and a target function  The inner product between two kernel matrices based on kernel k1 and k2.  The alignment of K1 and K2 w.r.t S:  Measure the degree of agreement between a kernel and a given learning task. 1 2 1 i j 2 i j F , 1 K ,K K (x ,x )K (x ,x ) n i j   ^ 1 2 F 1 2 1 1 2 2 F F K ,K A(S, k ,k ) K ,K K ,K 
  • 48. Kernel Alignment with SDP [G. R. G. Lanckriet et al, 2004]  Optimizing the alignment between a set of labels and a kernel matrix using SDP in a transductive setting.  Optimizing an objective function over the training data block -> automatic tuning of testing data block  Introduce A with , this reduces to T tr F A,K T n max K , yy A K . . trace(A) 1, K =0, =0. K I s t          ^ T 1 K max A( ,K ,yy ) s.t. K =0, trace(K) =1 S  tr tr,t ij i j tr t T tr,t K K K= , where K (x ), (x ) ,i, j =1, ,n n . K K              T K K =A and trace(A) 1  
  • 49. Learning with Idealized Kernel [J. T. Kwok and I.W. Tsang,2003]  Idealize a given kernel by making it more similar to the ideal kernel matrix.  Ideal kernel:  Idealized kernel:  The alignment of will be greater than k, if are the number of positive and negative samples.  Under the original distance metric M: i j * i j i j 1, y(x ) y(x ) k (x ,x ) 0, y(x ) y(x )         ~ * k = k + k 2  * 2 2 K,K n n       ~ k 2 ~ ~ ~ ij i j 2 ij i j d y =y K K 2K d y y ii jj ij            T 2 T i j i j ij i j i j k(x , x ) = x Mx , M =0; d (x - x ) M(x - x )   , n n  
  • 50. ij i j i j 2 T S 2 B, , (x ,x ) (x ,x ) ~ ~ 2 2 ij ij i j ~ 2 2 ij ij i j 1 1 min B , where B= AA 2 , (x ,x ) . . , 0, 0, , (x ,x ) ij D ij S D S D ij ij ij C C N N d d D s t d d S                                         Idealized kernel  We modify  Search for a matrix A under which  different classes : pulled apart by an amount of at least  same class :getting close together.  Introduce slack variables for error tolerance 2 ~ ij i j 2 ij 2 ij i j d y = y d d y y           2 ~ T T ij i j i j (x - x ) A A(x - x ) d  
  • 51. Conclusions  A comprehensive review, covers:  Supervised distance metric learning  Unsupervised distance metric learning  Maximum margin based distance metric learning approaches  Kernel methods towards distance metrics  Challenge:  Unsupervised distance metric learning.  Going local in a principle manner.  Learn an explicit nonlinear distance metric in the local sense.  Efficiency issue.