SlideShare a Scribd company logo
1 of 90
Download to read offline
Dimensionality Reduction Techniques
Dr Yogeshwar Singh Dadwhal
Acknowledgements
Slides from : Pattern Recognition Dr. George Bebis & Duda et al. University of Nevada (UNR) , Statquest
Images and Content from : Feature Selection Jain, A.K.; Duin , P.W.;
Jianchang Mao, “Statistical pattern recognition: a review”, IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000.
Curse of Dimensionality
• Increasing the number of features
will not always improve
classification accuracy.
• In practice, the inclusion of more
features might actually lead to
worse performance.
• The number of training examples
required increases exponentially
with dimensionality d (i.e., kd). 32 bins
33 bins
31 bins
k: number of bins per feature
k=3
3
Dimensionality Reduction
• What is the objective?
− Choose an optimum set of features of lower
dimensionality to improve classification accuracy.
• Different methods can be used to reduce
dimensionality:
− Feature extraction
− Feature selection
4
Dimensionality Reduction (cont’d)
Feature extraction: finds a
set of new features (i.e.,
through some mapping f())
from the existing features.
1
2
1
2
.
.
.
.
.
.
. K
i
i
i
N
x
x
x
x
x
x
 
 
   
   
   
   
= → =
   
   
   
   
 
 
 
x y
1
2
1
2
( )
.
.
.
.
.
.
.
f
K
N
x
x
y
y
y
x
 
 
   
   
   
   
= ⎯⎯⎯
→ =
   
   
   
 
 
 
 
 
x
x y
Feature selection:
chooses a subset of the
original features.
The mapping f()
could be linear or
non-linear
K<<N K<<N
1
2
1
2
( )
.
.
.
.
.
.
.
f
K
N
x
x
y
y
y
x
 
 
   
   
   
   
= ⎯⎯⎯
→ =
   
   
   
 
 
 
 
 
x
x y
Feature Extraction
• Linear combinations are particularly attractive because
they are simpler to compute and analytically tractable.
• Given x ϵ RN, find an K x N matrix T such that:
y = Tx ϵ RK where K<<N
5
T This is a projection from
the N-dimensional space
to a K-dimensional space.
Feature Extraction (cont’d)
• From a mathematical point of view, finding an optimum
mapping y=𝑓(x) is equivalent to optimizing an objective
criterion.
• Different methods use different objective criteria, e.g.,
− Minimize Information Loss: represent the data as accurately as
possible in the lower-dimensional space.
− Maximize Discriminatory Information: enhance the class-
discriminatory information in the lower-dimensional space.
6
Feature Extraction (cont’d)
• Popular linear feature extraction methods:
− Principal Components Analysis (PCA): Seeks a projection that
preserves as much information in the data as possible.
− Linear Discriminant Analysis (LDA): Seeks a projection that best
discriminates the data.
• Many other methods:
− Making features as independent as possible (Independent
Component Analysis or ICA).
− Retaining interesting directions (Projection Pursuit).
− Embedding to lower dimensional manifolds (Isomap, Locally Linear
Embedding or LLE).
7
Vector Representation
• A vector x ϵ Rn can be
represented by n components:
• Assuming the standard base
<v1, v2, …, vN> (i.e., unit vectors
in each dimension), xi can be
obtained by projecting x along
the direction of vi:
• x can be “reconstructed” from
its projections as follows:
8
1
2
.
.
:
.
.
.
N
x
x
x
 
 
 
 
 
 
 
 
 
 
 
 
 
x
T
T
i
i i
T
i i
v
x v
v v
= =
x
x
1 1 2 2
1
...
N
i i N N
i
x v x v x v x v
=
= = + + +

x
• Since the basis vectors are the same for all x ϵ Rn
(standard basis), we typically represent them as a
n-component vector.
Vector Representation (cont’d)
• Example assuming n=2:
• Assuming the standard base
<v1=i, v2=j>, xi can be obtained
by projecting x along the
direction of vi:
• x can be “reconstructed” from
its projections as follows:
9
1
2
3
:
4
x
x
   
=
   
 
 
x
 
1
1
3 4 3
0
T
x i
 
= = =
 
 
x
3 4
i j
= +
x
 
2
0
3 4 4
1
T
x j
 
= = =
 
 
x
i
j
10
Principal Component Analysis (PCA)
• If x∈RN, then it can be written a linear combination of an
orthonormal set of N basis vectors <v1,v2,…,v𝑁> in RN
(e.g., using the standard base):
• PCA seeks to approximate x in a subspace of RN using a
new set of K<<N basis vectors <u1, u2, …,uK> in RN:
such that is minimized!
(i.e., minimize information loss)
1 1 2 2
1
...
N
i i N N
i
x v x v x v x v
=
= = + + +

x
1
0
T
i j
if i j
v v
otherwise
=

= 
 T
T
i
i i
T
i i
v
where x v
v v
= =
x
x
1 1 2 2
1
ˆ ...
K
i i K K
i
y u y u y u y u
=
= = + + +

x
ˆ
|| ||
−
x x
1
2
ˆ : .
.
K
y
y
y
 
 
 
 
 
 
 
 
x
1
2
.
.
:
.
.
.
N
x
x
x
 
 
 
 
 
 
 
 
 
 
 
 
 
x
T
T
i
i i
T
i i
u
where y u
u u
= =
x
x
(reconstruction)
11
Principal Component Analysis (PCA)
• The “optimal” set of basis vectors <u1, u2, …,uK> can be
found as follows (we will see why):
(1) Find the eigenvectors u𝑖 of the covariance matrix of the
(training) data Σx
Σx u𝑖= 𝜆𝑖 u𝑖
(2) Choose the K “largest” eigenvectors u𝑖 (i.e., corresponding
to the K “largest” eigenvalues 𝜆𝑖)
<u1, u2, …,uK> correspond to the “optimal” basis!
We refer to the “largest” eigenvectors u𝑖 as principal components.
Data Set
12
Adding another dimension
13
Adding another dimension
14
How do I
plot?
From 4-D to 2-D
15
Consider the 2D set
Calculating Centre of Data
17
17
Average measurement for Gene 1 and Gene 2
Data Shifting
18
Fitting a line passing the origin
19
20
21
Deciding if a fit is good or not?
22
23
Measuring distances to optimize
24
Squaring and adding all distances
25
Rotate the red line until we find the maximum SS
PC1
Slope = 0.25
26
Slope of PC1 = 0.25
For every 4 units that we go out along Gene 1 axis,
we go up 1 unit along the Gene 2 axis
Conclusion: Data are mostly spread out along Gene 1 axis
and little bit spread on the Gene 2 axis
PC1 is a
linear
combination
Gene 1 and
Gene 2
Length of Red Line is 4.12 !
27
Scaling the red line
28
The loading scores of PC1 tell us
that in terms of PC1 Gene1 is 4
times important than Gene2
Eigen Vector
29
Eigen Value
30
31
PC2
32
Scaling
This is Singular
Vector for PC2 or
Eigenvector for PC2
The loading scores of PC2 tell us
that in terms of PC2 Gene2 is 4
times important than Gene 1
Eigen Value of PC2
33
Final PCA plot
34
Rotate the plot
Plot the corresponding points
Variation around a PC
35
Considering 3 Dimensions
36
Reducing to 2 Dimensions
37
To convert the 3D graph
into a 2D PCA graph, we
strip away everything but
the data and PC1 and
PC2
Project the samples
onto PC1 and PC2
Rotate the
samples
SOME BACKGROUND
MATHEMATICS ON
PCA
38
• Suppose we are given x1, x2, ..., xM (N x 1) vectors
Step 1: compute sample mean
Step 2: subtract sample mean (i.e., center data at zero)
Step 3: compute the sample covariance matrix Σx
39
PCA - Steps
N: # of features
M: # data
1
1 M
i
i
M =
= 
x x
Φi i
= −
x x
i i
1 1
1 1
( )( )
M M
T T
i i
i i
M M
= =
 = − − =   =
 
x x x x x
1 T
AA
M
where A=[Φ1 Φ2 ... ΦΜ]
i.e., the columns of A are the Φi
(N x M matrix)
Step 4: compute the eigenvalues/eigenvectors of Σx
Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis
in RN and we can represent any x∈RN as:
40
PCA - Steps
1 2 ... N
  
  
1 1 2 2
1
...
N
i i N N
i
y u y u y u y u
=
− = = + + +

x x
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)
x i i i
u u

 =
where we assume
Note : most software packages normalize ui to unit length to simplify calculations; if
not, you can explicitly normalize them)
( )
( ) || || 1
T
T
i
i i i
T
i i
u
y u if u
u u
−
= = − =
x x
x x
i.e., this is
just a “change”
of basis!
1 1
2 2
. .
. .
:
. .
. .
. .
N N
x y
x y
x y
   
   
   
   
   
   
− →
   
   
   
   
   
   
   
x x
Step 5: dimensionality reduction step – approximate x using
only the first K eigenvectors (K<<N) (i.e., corresponding to
the K largest eigenvalues where K is a parameter):
41
PCA - Steps
1 1 2 2
1
ˆ ...
K
i i K K
i
y u y u y u y u
=
− = = + + +

x x
1 1 2 2
1
...
N
i i N N
i
y u y u y u y u
=
− = = + + +

x x
approximate
using first K eigenvectors only
ˆ =
x x
note that if K=N, then
(i.e., zero reconstruction error)
ˆ
by
x x
1 1
2 2
1
2
. .
. .
ˆ
: : .
. .
.
. .
. . K
N N
x y
x y
y
y
y
x y
   
   
     
     
     
     
− → → −
     
     
     
 
   
   
   
   
x x x x
(reconstruction)
42
What is the Linear Transformation
implied by PCA?
• The linear transformation y = Tx which performs the
dimensionality reduction in PCA is:
1
2
ˆ
( )
.
.
T
K
y
y
U
y
 
 
 
  = −
 
 
 
 
x x i.e., the rows of T are the first
K eigenvectors of Σx
1
2
ˆ
( ) .
.
K
y
y
U
y
 
 
 
 
− =
 
 
 
 
x x
T = UT
1 2
[ ... ] matrix
K
where U u u u N xK
=
i.e., the columns of U are the
the first K eigenvectors of Σx
1 1 2 2
1
ˆ ...
K
i i K K
i
y u y u y u y u
=
− = = + + +

x x
K x N matrix
What is the form of Σy ?
43
i i
1 1
1 1
( )( )
M M
T T
i i
i i
M M
= =
 = − − =  
 
x x x x x
( )
T T
i i i
U P
= − = 
y x x
The columns of P are the
eigenvectors of ΣX
The diagonal elements of
Λ are the eigenvalues of ΣX
or the variances
T
P P
 = 
x
i i
1
1
( )( )
M
T
i
M =
 = − − =

y y y y y
1
1
( )
M
T T
i i
i
P P
M =
  =
 ( )
T T
P P P P
 =
i i
1
1
( )( )
M
T
i
M =
=
 y y
1
1
( )( )
M
T T T
i i
i
P P
M =
  =

1
1
( )( )
M
T T
i i
i
P P
M =
  =
 T
P P
 =
x

 = 
y
PCA de-correlates the data!
Preserves original variances!
Using diagonalization:
44
Interpretation of PCA
• PCA chooses the eigenvectors of
the covariance matrix corresponding
to the largest eigenvalues.
• The eigenvalues correspond to the
variance of the data along the
eigenvector directions.
• Therefore, PCA projects the data
along the directions where the data
varies most.
• PCA preserves as much information
in the data by preserving as much
variance in the data.
u1: direction of max variance
u2: orthogonal to u1
Example
• Compute the PCA of the following dataset:
(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)
• Compute the sample covariance matrix is:
• The eigenvalues can be computed by finding the roots of the
characteristic polynomial:
45
1
1
ˆ ˆ ˆ
( )( )
n
t
k k
k
n =
 = − −
 x μ x μ
Example (cont’d)
• The eigenvectors are the solutions of the systems:
Note: if ui is a solution, then cui is also a solution where c≠0.
Eigenvectors can be normalized to unit-length using:
46
i i i
u u

 =
x
ˆ
|| ||
i
i
i
v
v
v
=
47
How do we choose K ?
• K is typically chosen based on how much information
(variance) we want to preserve:
• If T=0.9, for example, we “preserve” 90% of the information
(variance) in the data.
• If K=N, then we “preserve” 100% of the information in the
data (i.e., just a “change” of basis and )
1
1
( . .,0.9)
K
i
i
N
i
i
T where T is a threshold e g


=
=



ˆ =
x x
Choose the smallest
K that satisfies
the following
inequality:
48
Approximation Error
• The approximation error (or reconstruction error) can be
computed by:
• It can also be shown that the approximation error can be
computed as follows:
1
1
ˆ
|| ||
2
N
i
i K

= +
− = 
x x
ˆ
|| ||
−
x x
1 1 2 2
1
ˆ ...
K
i i K K
i
y u y u y u y u
=
= + = + + + +

x x x
where
(reconstruction)
49
Data Normalization
• The principal components are dependent on the units used
to measure the original variables as well as on the range of
values they assume.
• Data should always be normalized prior to using PCA.
• A common normalization method is to transform all the data
to have zero mean and unit standard deviation:
i
x 

− where μ and σ are the mean and standard
deviation of the i-th feature xi
50
Application to Images
• The goal is to represent images in a space of lower
dimensionality using PCA.
− Useful for various applications, e.g., face recognition, image
compression, etc.
• Given M images of size N x N, first represent each image
as a 1D vector (i.e., by stacking the rows together).
− Note that for face recognition, faces must be centered and of the
same size.
Application to Images (cont’d)
• The key challenge is that the covariance matrix Σx is now
very large (i.e., N2 x N2) – see Step 3:
Step 3: compute the covariance matrix Σx
• Σx is now an N2 x N2 matrix – computationally expensive to
compute its eigenvalues/eigenvectors λi, ui
(AAT)ui= λiui
51
1
1 1
M
T T
i
i
AA
M M
=
 =   =

x i
where A=[Φ1 Φ2 ... ΦΜ]
(N2 x M matrix)
Application to Images (cont’d)
• We will use a simple “trick” to get around this by relating
the eigenvalues/eigenvectors of AAT to those of ATA.
• Let us consider the matrix ATA instead (i.e., M x M matrix)
− Suppose its eigenvalues/eigenvectors are μi, vi
(ATA)vi= μivi
− Multiply both sides by A:
A(ATA)vi=Aμivi or (AAT)(Avi)= μi(Avi)
− Assuming (AAT)ui= λiui
λi=μi and ui=Avi
52
A=[Φ1 Φ2 ... ΦΜ]
(N2 x M matrix)
Application to Images (cont’d)
• But do AAT and ATA have the same number of
eigenvalues/eigenvectors?
− AAT can have up to N2 eigenvalues/eigenvectors.
− ATA can have up to M eigenvalues/eigenvectors.
− It can be shown that the M eigenvalues/eigenvectors of ATA are
also the M largest eigenvalues/eigenvectors of AAT
• Steps 3-5 of PCA need to be updated as follows:
53
Application to Images (cont’d)
Step 3 compute ATA (i.e., instead of AAT)
Step 4: compute μi, vi of ATA
Step 4b: compute λi, ui of AAT using λi=μi and ui=Avi, then
normalize ui to unit length.
Step 5: dimensionality reduction step – approximate x using
only the first K eigenvectors (K<M):
54
1 1 2 2
1
ˆ ...
K
i i K K
i
y u y u y u y u
=
− = = + + +

x x
each image can be
represented by
a K-dimensional
vector
1
2
ˆ : .
.
K
y
y
y
 
 
 
 
−
 
 
 
 
x x
55
Example
Dataset
56
Example (cont’d)
Top eigenvectors: u1,…uk
(visualized as an image - eigenfaces)
Mean face: x
u1
u2 u3
57
Example (cont’d)
u1
u2 u3
• How can you visualize the eigenvectors (eigenfaces)
as an image?
− Their values must be first mapped to integer values in
the interval [0, 255] (required by PGM format).
− Suppose fmin and fmax are the min/max values of a given
eigenface (could be negative).
− If xϵ[fmin, fmax] is the original value, then the new value
yϵ[0,255] can be computed as follows:
y=(int)255(x - fmin)/(fmax - fmin)
58
Application to Images (cont’d)
• Interpretation: represent a face in terms of eigenfaces
1 1 2 2
1
ˆ ...
K
i i K K
i
y u y u y u y u
=
= = + + + +

x x
+x
u1 u2 u3
y1 y2 y3
1
2
ˆ : .
.
K
y
y
y
 
 
 
 
−
 
 
 
 
x x
59
Case Study: Eigenfaces for Face
Detection/Recognition
− M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of
Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.
• Face Recognition
− The simplest approach is to think of it as a template matching
problem.
− Problems arise when performing recognition in a high-dimensional
space.
− Use dimensionality reduction!
• Process the image database (i.e., set of images with
labels) – typically referred to as “training” phase:
− Compute PCA space using image database (i.e., training data)
− Represent each image in the database with K coefficients Ωi
Face Recognition Using Eigenfaces
1
2
.
.
K
y
y
y
 
 
 
 
 =
 
 
 
 
Ωi
Given an unknown face x, follow these steps:
Step 1: Subtract mean face (computed from training data)
Step 2: Project unknown face in the eigenspace:
Step 3: Find closest match Ωi from training set using:
Step 4: Recognize x as person “k” where k is the ID linked to Ωi
Note: for intruder rejection, we need er<Tr, for some threshold Tr
61
Face Recognition Using Eigenfaces
2 2
1 1
1
min || || min ( ) min ( )
K K
i i
r i i i j j i j j
j j j
e y y y y

= =
=  − = − −
 
or
The distance er is called distance in face space (difs)
T
i i
where y u
= 
1
2
: .
.
K
y
y
y
 
 
 
 

 
 
 
 
1
ˆ
K
i i
i
y u
=
 = 
 = −
x x
x
Euclidean distance Mahalanobis distance
62
Face detection vs recognition
Detection Recognition “Sally”
Given an unknown image x, follow these steps:
Step 1: Subtract mean face (computed from training data):
Step 2: Project unknown face in the eigenspace:
Step 3: Compute
Step 4: if ed<Td, then x is a face.
63
Face Detection Using Eigenfaces
The distance ed is called distance from face space (dffs)
T
i i
where y u
= 
1
ˆ
K
i i
i
y u
=
 = 
ˆ
|| ||
d
e =  − 
x
 = −
x x
64
Eigenfaces
Reconstructed image looks
like a face.
Reconstructed image looks
like a face.
Reconstructed image
looks like a face again!
Input Reconstructed
65
Reconstruction from partial information
• Robust to partial face occlusion.
Input Reconstructed
66
Eigenfaces
• Can be used for face detection, tracking, and recognition!
Visualize dffs as an image:
ˆ
|| ||
d
e =  − 
Dark: small distance
Bright: large distance
67
Limitations
• Background changes cause problems
− De-emphasize the outside of the face (e.g., by multiplying the input
image by a 2D Gaussian window centered on the face).
• Light changes degrade performance
− Light normalization might help but this is a challenging issue.
• Performance decreases quickly with changes to face size
− Scale input image to multiple sizes.
− Multi-scale eigenspaces.
• Performance decreases with changes to face orientation
(but not as fast as with scale changes)
− Out-of-plane rotations are more difficult to handle.
− Multi-orientation eigenspaces.
68
Limitations (cont’d)
• Not robust to misalignment.
69
Limitations (cont’d)
• PCA is not always an optimal dimensionality-reduction
technique for classification purposes.
70
Linear Discriminant Analysis (LDA)
• What is the goal of LDA?
− Seeks to find directions along which the classes are best
separated (i.e., increase discriminatory information).
− It takes into consideration the scatter (i.e., variance) within-
classes and between-classes.
Bad separability Good separability
projection direction
projection direction
• Let us assume C classes with each class containing Mi samples,
i=1,2,..,C and M the total number of samples:
• Let μi is the mean of the i-th class, i=1,2,…,C and μ is the mean of the
whole dataset:
Within-class scatter matrix
Between-class scatter matrix
71
Linear Discriminant Analysis (LDA) (cont’d)
1 1
( )( )
i
M
C
T
w j i j i
i j
S  
= =
= − −
 x μ x μ
1
C
i
i
M M
=
= 
1
( )( )
C
T
b i i
i
S
=
= − −
 μ μ μ μ
1
1 C
i
i
C =
= 
μ μ
72
Linear Discriminant Analysis (LDA) (cont’d)
• LDA seeks transformations that maximize the between-
class scatter and minimize the within-class scatter:
| |
max
| |
b
w
S
S
T
U
=
y x
,
b w
S S
• Suppose the desired projection transformation is:
• Suppose the scatter matrices of the projected data y are:
| |
max
| |
T
b
T
w
U S U
U S U
or
73
Linear Discriminant Analysis (LDA) (cont’d)
• It can be shown that the columns of the matrix U are the
eigenvectors (i.e., called Fisherfaces) corresponding to the
largest eigenvalues of the following generalized eigen-
problem:
• It can be shown that Sb has at most rank C-1; therefore,
the max number of eigenvectors with non-zero
eigenvalues is C-1, that is:
max dimensionality of LDA sub-space is C-1
b k k w k
S u S u

=
e.g., when C=2, we always end up with one LDA feature
no matter what the original number of features was!
Example
74
75
Linear Discriminant Analysis (LDA) (cont’d)
• If Sw is non-singular, we can solve a conventional
eigenvalue problem as follows:
• In practice, Sw is singular due to the high dimensionality
of the data (e.g., images) and a much lower number of
data (M << N )
1
w b k k k
S S u u

−
=
b k k w k
S u S u

=
76
Linear Discriminant Analysis (LDA) (cont’d)
• To alleviate this problem, PCA could be applied first:
1) First, apply PCA to reduce data dimensionality:
2) Then, apply LDA to find the most discriminative directions:
1 1
2 2
. .
. .
PCA
N M
x y
x y
x y
   
   
   
   
= ⎯⎯⎯
→ =
   
   
   
 
 
x y
1 1
2 2
. .
. .
LDA
M K
y z
y z
y z
   
   
   
   
= ⎯⎯⎯
→ =
   
   
   
   
y z
77
Case Study I
− D. Swets, J. Weng, "Using Discriminant Eigenfeatures for Image
Retrieval", IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 18, no. 8, pp. 831-836, 1996.
• Content-based image retrieval:
− Application: query-by-example content-based image retrieval
− Question: how to select a good set of image features?
78
Case Study I (cont’d)
• Assumptions
− Well-framed images are required as input for training and query-by-
example test probes.
− Only a small variation in the size, position, and orientation of the
objects in the images is allowed.
79
Case Study I (cont’d)
• Terminology
− Most Expressive Features (MEF): features obtained using PCA.
− Most Discriminating Features (MDF): features obtained using LDA.
• Numerical instabilities
− Computing the eigenvalues/eigenvectors of Sw
-1SBuk = kuk could
lead to unstable computations since Sw
-1SB is not always symmetric.
− Check the paper for more details about how to deal with this issue.
80
Case Study I (cont’d)
• Comparing projection directions between MEF with MDF:
− PCA eigenvectors show the tendency of PCA to capture major
variations in the training set such as lighting direction.
− LDA eigenvectors discount those factors unrelated to classification.
81
Case Study I (cont’d)
• Clustering effect
PCA space LDA space
82
Case Study I (cont’d)
1) Represent each training image in terms of MDFs (or MEFs for
comparison).
2) Represent a query image in terms of MDFs (or MEFs for
comparson).
3) Find the k closest neighbors (e.g., using Euclidean distance).
• Methodology
83
Case Study I (cont’d)
• Experiments and results
Face images
− A set of face images was used with 2 expressions, 3 lighting conditions.
− Testing was performed using a disjoint set of images.
84
Case Study I (cont’d)
Top match (k=1)
85
Case Study I (cont’d)
− Examples of correct search probes
86
Case Study I (cont’d)
− Example of a failed search probe
87
Case Study II
− A. Martinez, A. Kak, "PCA versus LDA", IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-
233, 2001.
• Is LDA always better than PCA?
− There has been a tendency in the computer vision community to
prefer LDA over PCA.
− This is mainly because LDA deals directly with discrimination
between classes while PCA does not pay attention to the underlying
class structure.
88
Case Study II (cont’d)
AR database
89
Case Study II (cont’d)
LDA is not always better when the training set is small
PCA w/o 3: not using the
first three principal components
that seem to encode mostly
variations due to lighting
90
Case Study II (cont’d)
LDA outperforms PCA when the training set is large
PCA w/o 3: not using the
first three principal components
that seem to encode mostly
variations due to lighting

More Related Content

Similar to Dimensionality Reduction Techniques Explained

PCA (Principal component analysis) Theory and Toolkits
PCA (Principal component analysis) Theory and ToolkitsPCA (Principal component analysis) Theory and Toolkits
PCA (Principal component analysis) Theory and ToolkitsHopeBay Technologies, Inc.
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Rediet Moges
 
Output primitives in Computer Graphics
Output primitives in Computer GraphicsOutput primitives in Computer Graphics
Output primitives in Computer GraphicsKamal Acharya
 
Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationMarjan Sterjev
 
Unit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithmsUnit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithmsAmol Gaikwad
 
Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learningSteve Nouri
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptxAbdusSadik
 
Efficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsEfficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsAlexander Litvinenko
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V EvfimievskiArvind Surve
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V EvfimievskiArvind Surve
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxSivam Chinna
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAminaRepo
 

Similar to Dimensionality Reduction Techniques Explained (20)

PCA (Principal component analysis) Theory and Toolkits
PCA (Principal component analysis) Theory and ToolkitsPCA (Principal component analysis) Theory and Toolkits
PCA (Principal component analysis) Theory and Toolkits
 
ML unit2.pptx
ML unit2.pptxML unit2.pptx
ML unit2.pptx
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
 
Output primitives in Computer Graphics
Output primitives in Computer GraphicsOutput primitives in Computer Graphics
Output primitives in Computer Graphics
 
module 1.pdf
module 1.pdfmodule 1.pdf
module 1.pdf
 
overviewPCA
overviewPCAoverviewPCA
overviewPCA
 
Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and Visualization
 
Optimization tutorial
Optimization tutorialOptimization tutorial
Optimization tutorial
 
Unit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithmsUnit-2 raster scan graphics,line,circle and polygon algorithms
Unit-2 raster scan graphics,line,circle and polygon algorithms
 
Understandig PCA and LDA
Understandig PCA and LDAUnderstandig PCA and LDA
Understandig PCA and LDA
 
Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learning
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
Efficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsEfficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formats
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
 
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by  Alexandre V EvfimievskiClustering and Factorization using Apache SystemML by  Alexandre V Evfimievski
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 

Recently uploaded

Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 

Recently uploaded (20)

Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 

Dimensionality Reduction Techniques Explained

  • 1. Dimensionality Reduction Techniques Dr Yogeshwar Singh Dadwhal Acknowledgements Slides from : Pattern Recognition Dr. George Bebis & Duda et al. University of Nevada (UNR) , Statquest Images and Content from : Feature Selection Jain, A.K.; Duin , P.W.; Jianchang Mao, “Statistical pattern recognition: a review”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000.
  • 2. Curse of Dimensionality • Increasing the number of features will not always improve classification accuracy. • In practice, the inclusion of more features might actually lead to worse performance. • The number of training examples required increases exponentially with dimensionality d (i.e., kd). 32 bins 33 bins 31 bins k: number of bins per feature k=3
  • 3. 3 Dimensionality Reduction • What is the objective? − Choose an optimum set of features of lower dimensionality to improve classification accuracy. • Different methods can be used to reduce dimensionality: − Feature extraction − Feature selection
  • 4. 4 Dimensionality Reduction (cont’d) Feature extraction: finds a set of new features (i.e., through some mapping f()) from the existing features. 1 2 1 2 . . . . . . . K i i i N x x x x x x                     = → =                       x y 1 2 1 2 ( ) . . . . . . . f K N x x y y y x                     = ⎯⎯⎯ → =                       x x y Feature selection: chooses a subset of the original features. The mapping f() could be linear or non-linear K<<N K<<N
  • 5. 1 2 1 2 ( ) . . . . . . . f K N x x y y y x                     = ⎯⎯⎯ → =                       x x y Feature Extraction • Linear combinations are particularly attractive because they are simpler to compute and analytically tractable. • Given x ϵ RN, find an K x N matrix T such that: y = Tx ϵ RK where K<<N 5 T This is a projection from the N-dimensional space to a K-dimensional space.
  • 6. Feature Extraction (cont’d) • From a mathematical point of view, finding an optimum mapping y=𝑓(x) is equivalent to optimizing an objective criterion. • Different methods use different objective criteria, e.g., − Minimize Information Loss: represent the data as accurately as possible in the lower-dimensional space. − Maximize Discriminatory Information: enhance the class- discriminatory information in the lower-dimensional space. 6
  • 7. Feature Extraction (cont’d) • Popular linear feature extraction methods: − Principal Components Analysis (PCA): Seeks a projection that preserves as much information in the data as possible. − Linear Discriminant Analysis (LDA): Seeks a projection that best discriminates the data. • Many other methods: − Making features as independent as possible (Independent Component Analysis or ICA). − Retaining interesting directions (Projection Pursuit). − Embedding to lower dimensional manifolds (Isomap, Locally Linear Embedding or LLE). 7
  • 8. Vector Representation • A vector x ϵ Rn can be represented by n components: • Assuming the standard base <v1, v2, …, vN> (i.e., unit vectors in each dimension), xi can be obtained by projecting x along the direction of vi: • x can be “reconstructed” from its projections as follows: 8 1 2 . . : . . . N x x x                           x T T i i i T i i v x v v v = = x x 1 1 2 2 1 ... N i i N N i x v x v x v x v = = = + + +  x • Since the basis vectors are the same for all x ϵ Rn (standard basis), we typically represent them as a n-component vector.
  • 9. Vector Representation (cont’d) • Example assuming n=2: • Assuming the standard base <v1=i, v2=j>, xi can be obtained by projecting x along the direction of vi: • x can be “reconstructed” from its projections as follows: 9 1 2 3 : 4 x x     =         x   1 1 3 4 3 0 T x i   = = =     x 3 4 i j = + x   2 0 3 4 4 1 T x j   = = =     x i j
  • 10. 10 Principal Component Analysis (PCA) • If x∈RN, then it can be written a linear combination of an orthonormal set of N basis vectors <v1,v2,…,v𝑁> in RN (e.g., using the standard base): • PCA seeks to approximate x in a subspace of RN using a new set of K<<N basis vectors <u1, u2, …,uK> in RN: such that is minimized! (i.e., minimize information loss) 1 1 2 2 1 ... N i i N N i x v x v x v x v = = = + + +  x 1 0 T i j if i j v v otherwise =  =   T T i i i T i i v where x v v v = = x x 1 1 2 2 1 ˆ ... K i i K K i y u y u y u y u = = = + + +  x ˆ || || − x x 1 2 ˆ : . . K y y y                 x 1 2 . . : . . . N x x x                           x T T i i i T i i u where y u u u = = x x (reconstruction)
  • 11. 11 Principal Component Analysis (PCA) • The “optimal” set of basis vectors <u1, u2, …,uK> can be found as follows (we will see why): (1) Find the eigenvectors u𝑖 of the covariance matrix of the (training) data Σx Σx u𝑖= 𝜆𝑖 u𝑖 (2) Choose the K “largest” eigenvectors u𝑖 (i.e., corresponding to the K “largest” eigenvalues 𝜆𝑖) <u1, u2, …,uK> correspond to the “optimal” basis! We refer to the “largest” eigenvectors u𝑖 as principal components.
  • 15. From 4-D to 2-D 15
  • 17. Calculating Centre of Data 17 17 Average measurement for Gene 1 and Gene 2
  • 19. Fitting a line passing the origin 19
  • 20. 20
  • 21. 21
  • 22. Deciding if a fit is good or not? 22
  • 23. 23
  • 24. Measuring distances to optimize 24
  • 25. Squaring and adding all distances 25 Rotate the red line until we find the maximum SS PC1 Slope = 0.25
  • 26. 26 Slope of PC1 = 0.25 For every 4 units that we go out along Gene 1 axis, we go up 1 unit along the Gene 2 axis Conclusion: Data are mostly spread out along Gene 1 axis and little bit spread on the Gene 2 axis PC1 is a linear combination Gene 1 and Gene 2
  • 27. Length of Red Line is 4.12 ! 27
  • 28. Scaling the red line 28 The loading scores of PC1 tell us that in terms of PC1 Gene1 is 4 times important than Gene2
  • 31. 31
  • 32. PC2 32 Scaling This is Singular Vector for PC2 or Eigenvector for PC2 The loading scores of PC2 tell us that in terms of PC2 Gene2 is 4 times important than Gene 1
  • 33. Eigen Value of PC2 33
  • 34. Final PCA plot 34 Rotate the plot Plot the corresponding points
  • 37. Reducing to 2 Dimensions 37 To convert the 3D graph into a 2D PCA graph, we strip away everything but the data and PC1 and PC2 Project the samples onto PC1 and PC2 Rotate the samples
  • 39. • Suppose we are given x1, x2, ..., xM (N x 1) vectors Step 1: compute sample mean Step 2: subtract sample mean (i.e., center data at zero) Step 3: compute the sample covariance matrix Σx 39 PCA - Steps N: # of features M: # data 1 1 M i i M = =  x x Φi i = − x x i i 1 1 1 1 ( )( ) M M T T i i i i M M = =  = − − =   =   x x x x x 1 T AA M where A=[Φ1 Φ2 ... ΦΜ] i.e., the columns of A are the Φi (N x M matrix)
  • 40. Step 4: compute the eigenvalues/eigenvectors of Σx Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis in RN and we can represent any x∈RN as: 40 PCA - Steps 1 2 ... N       1 1 2 2 1 ... N i i N N i y u y u y u y u = − = = + + +  x x Note : most software packages return the eigenvalues (and corresponding eigenvectors) is decreasing order – if not, you can explicitly put them in this order) x i i i u u   = where we assume Note : most software packages normalize ui to unit length to simplify calculations; if not, you can explicitly normalize them) ( ) ( ) || || 1 T T i i i i T i i u y u if u u u − = = − = x x x x i.e., this is just a “change” of basis! 1 1 2 2 . . . . : . . . . . . N N x y x y x y                         − →                             x x
  • 41. Step 5: dimensionality reduction step – approximate x using only the first K eigenvectors (K<<N) (i.e., corresponding to the K largest eigenvalues where K is a parameter): 41 PCA - Steps 1 1 2 2 1 ˆ ... K i i K K i y u y u y u y u = − = = + + +  x x 1 1 2 2 1 ... N i i N N i y u y u y u y u = − = = + + +  x x approximate using first K eigenvectors only ˆ = x x note that if K=N, then (i.e., zero reconstruction error) ˆ by x x 1 1 2 2 1 2 . . . . ˆ : : . . . . . . . . K N N x y x y y y y x y                                 − → → −                                     x x x x (reconstruction)
  • 42. 42 What is the Linear Transformation implied by PCA? • The linear transformation y = Tx which performs the dimensionality reduction in PCA is: 1 2 ˆ ( ) . . T K y y U y         = −         x x i.e., the rows of T are the first K eigenvectors of Σx 1 2 ˆ ( ) . . K y y U y         − =         x x T = UT 1 2 [ ... ] matrix K where U u u u N xK = i.e., the columns of U are the the first K eigenvectors of Σx 1 1 2 2 1 ˆ ... K i i K K i y u y u y u y u = − = = + + +  x x K x N matrix
  • 43. What is the form of Σy ? 43 i i 1 1 1 1 ( )( ) M M T T i i i i M M = =  = − − =     x x x x x ( ) T T i i i U P = − =  y x x The columns of P are the eigenvectors of ΣX The diagonal elements of Λ are the eigenvalues of ΣX or the variances T P P  =  x i i 1 1 ( )( ) M T i M =  = − − =  y y y y y 1 1 ( ) M T T i i i P P M =   =  ( ) T T P P P P  = i i 1 1 ( )( ) M T i M = =  y y 1 1 ( )( ) M T T T i i i P P M =   =  1 1 ( )( ) M T T i i i P P M =   =  T P P  = x   =  y PCA de-correlates the data! Preserves original variances! Using diagonalization:
  • 44. 44 Interpretation of PCA • PCA chooses the eigenvectors of the covariance matrix corresponding to the largest eigenvalues. • The eigenvalues correspond to the variance of the data along the eigenvector directions. • Therefore, PCA projects the data along the directions where the data varies most. • PCA preserves as much information in the data by preserving as much variance in the data. u1: direction of max variance u2: orthogonal to u1
  • 45. Example • Compute the PCA of the following dataset: (1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8) • Compute the sample covariance matrix is: • The eigenvalues can be computed by finding the roots of the characteristic polynomial: 45 1 1 ˆ ˆ ˆ ( )( ) n t k k k n =  = − −  x μ x μ
  • 46. Example (cont’d) • The eigenvectors are the solutions of the systems: Note: if ui is a solution, then cui is also a solution where c≠0. Eigenvectors can be normalized to unit-length using: 46 i i i u u   = x ˆ || || i i i v v v =
  • 47. 47 How do we choose K ? • K is typically chosen based on how much information (variance) we want to preserve: • If T=0.9, for example, we “preserve” 90% of the information (variance) in the data. • If K=N, then we “preserve” 100% of the information in the data (i.e., just a “change” of basis and ) 1 1 ( . .,0.9) K i i N i i T where T is a threshold e g   = =    ˆ = x x Choose the smallest K that satisfies the following inequality:
  • 48. 48 Approximation Error • The approximation error (or reconstruction error) can be computed by: • It can also be shown that the approximation error can be computed as follows: 1 1 ˆ || || 2 N i i K  = + − =  x x ˆ || || − x x 1 1 2 2 1 ˆ ... K i i K K i y u y u y u y u = = + = + + + +  x x x where (reconstruction)
  • 49. 49 Data Normalization • The principal components are dependent on the units used to measure the original variables as well as on the range of values they assume. • Data should always be normalized prior to using PCA. • A common normalization method is to transform all the data to have zero mean and unit standard deviation: i x   − where μ and σ are the mean and standard deviation of the i-th feature xi
  • 50. 50 Application to Images • The goal is to represent images in a space of lower dimensionality using PCA. − Useful for various applications, e.g., face recognition, image compression, etc. • Given M images of size N x N, first represent each image as a 1D vector (i.e., by stacking the rows together). − Note that for face recognition, faces must be centered and of the same size.
  • 51. Application to Images (cont’d) • The key challenge is that the covariance matrix Σx is now very large (i.e., N2 x N2) – see Step 3: Step 3: compute the covariance matrix Σx • Σx is now an N2 x N2 matrix – computationally expensive to compute its eigenvalues/eigenvectors λi, ui (AAT)ui= λiui 51 1 1 1 M T T i i AA M M =  =   =  x i where A=[Φ1 Φ2 ... ΦΜ] (N2 x M matrix)
  • 52. Application to Images (cont’d) • We will use a simple “trick” to get around this by relating the eigenvalues/eigenvectors of AAT to those of ATA. • Let us consider the matrix ATA instead (i.e., M x M matrix) − Suppose its eigenvalues/eigenvectors are μi, vi (ATA)vi= μivi − Multiply both sides by A: A(ATA)vi=Aμivi or (AAT)(Avi)= μi(Avi) − Assuming (AAT)ui= λiui λi=μi and ui=Avi 52 A=[Φ1 Φ2 ... ΦΜ] (N2 x M matrix)
  • 53. Application to Images (cont’d) • But do AAT and ATA have the same number of eigenvalues/eigenvectors? − AAT can have up to N2 eigenvalues/eigenvectors. − ATA can have up to M eigenvalues/eigenvectors. − It can be shown that the M eigenvalues/eigenvectors of ATA are also the M largest eigenvalues/eigenvectors of AAT • Steps 3-5 of PCA need to be updated as follows: 53
  • 54. Application to Images (cont’d) Step 3 compute ATA (i.e., instead of AAT) Step 4: compute μi, vi of ATA Step 4b: compute λi, ui of AAT using λi=μi and ui=Avi, then normalize ui to unit length. Step 5: dimensionality reduction step – approximate x using only the first K eigenvectors (K<M): 54 1 1 2 2 1 ˆ ... K i i K K i y u y u y u y u = − = = + + +  x x each image can be represented by a K-dimensional vector 1 2 ˆ : . . K y y y         −         x x
  • 56. 56 Example (cont’d) Top eigenvectors: u1,…uk (visualized as an image - eigenfaces) Mean face: x u1 u2 u3
  • 57. 57 Example (cont’d) u1 u2 u3 • How can you visualize the eigenvectors (eigenfaces) as an image? − Their values must be first mapped to integer values in the interval [0, 255] (required by PGM format). − Suppose fmin and fmax are the min/max values of a given eigenface (could be negative). − If xϵ[fmin, fmax] is the original value, then the new value yϵ[0,255] can be computed as follows: y=(int)255(x - fmin)/(fmax - fmin)
  • 58. 58 Application to Images (cont’d) • Interpretation: represent a face in terms of eigenfaces 1 1 2 2 1 ˆ ... K i i K K i y u y u y u y u = = = + + + +  x x +x u1 u2 u3 y1 y2 y3 1 2 ˆ : . . K y y y         −         x x
  • 59. 59 Case Study: Eigenfaces for Face Detection/Recognition − M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991. • Face Recognition − The simplest approach is to think of it as a template matching problem. − Problems arise when performing recognition in a high-dimensional space. − Use dimensionality reduction!
  • 60. • Process the image database (i.e., set of images with labels) – typically referred to as “training” phase: − Compute PCA space using image database (i.e., training data) − Represent each image in the database with K coefficients Ωi Face Recognition Using Eigenfaces 1 2 . . K y y y          =         Ωi
  • 61. Given an unknown face x, follow these steps: Step 1: Subtract mean face (computed from training data) Step 2: Project unknown face in the eigenspace: Step 3: Find closest match Ωi from training set using: Step 4: Recognize x as person “k” where k is the ID linked to Ωi Note: for intruder rejection, we need er<Tr, for some threshold Tr 61 Face Recognition Using Eigenfaces 2 2 1 1 1 min || || min ( ) min ( ) K K i i r i i i j j i j j j j j e y y y y  = = =  − = − −   or The distance er is called distance in face space (difs) T i i where y u =  1 2 : . . K y y y                  1 ˆ K i i i y u =  =   = − x x x Euclidean distance Mahalanobis distance
  • 62. 62 Face detection vs recognition Detection Recognition “Sally”
  • 63. Given an unknown image x, follow these steps: Step 1: Subtract mean face (computed from training data): Step 2: Project unknown face in the eigenspace: Step 3: Compute Step 4: if ed<Td, then x is a face. 63 Face Detection Using Eigenfaces The distance ed is called distance from face space (dffs) T i i where y u =  1 ˆ K i i i y u =  =  ˆ || || d e =  −  x  = − x x
  • 64. 64 Eigenfaces Reconstructed image looks like a face. Reconstructed image looks like a face. Reconstructed image looks like a face again! Input Reconstructed
  • 65. 65 Reconstruction from partial information • Robust to partial face occlusion. Input Reconstructed
  • 66. 66 Eigenfaces • Can be used for face detection, tracking, and recognition! Visualize dffs as an image: ˆ || || d e =  −  Dark: small distance Bright: large distance
  • 67. 67 Limitations • Background changes cause problems − De-emphasize the outside of the face (e.g., by multiplying the input image by a 2D Gaussian window centered on the face). • Light changes degrade performance − Light normalization might help but this is a challenging issue. • Performance decreases quickly with changes to face size − Scale input image to multiple sizes. − Multi-scale eigenspaces. • Performance decreases with changes to face orientation (but not as fast as with scale changes) − Out-of-plane rotations are more difficult to handle. − Multi-orientation eigenspaces.
  • 68. 68 Limitations (cont’d) • Not robust to misalignment.
  • 69. 69 Limitations (cont’d) • PCA is not always an optimal dimensionality-reduction technique for classification purposes.
  • 70. 70 Linear Discriminant Analysis (LDA) • What is the goal of LDA? − Seeks to find directions along which the classes are best separated (i.e., increase discriminatory information). − It takes into consideration the scatter (i.e., variance) within- classes and between-classes. Bad separability Good separability projection direction projection direction
  • 71. • Let us assume C classes with each class containing Mi samples, i=1,2,..,C and M the total number of samples: • Let μi is the mean of the i-th class, i=1,2,…,C and μ is the mean of the whole dataset: Within-class scatter matrix Between-class scatter matrix 71 Linear Discriminant Analysis (LDA) (cont’d) 1 1 ( )( ) i M C T w j i j i i j S   = = = − −  x μ x μ 1 C i i M M = =  1 ( )( ) C T b i i i S = = − −  μ μ μ μ 1 1 C i i C = =  μ μ
  • 72. 72 Linear Discriminant Analysis (LDA) (cont’d) • LDA seeks transformations that maximize the between- class scatter and minimize the within-class scatter: | | max | | b w S S T U = y x , b w S S • Suppose the desired projection transformation is: • Suppose the scatter matrices of the projected data y are: | | max | | T b T w U S U U S U or
  • 73. 73 Linear Discriminant Analysis (LDA) (cont’d) • It can be shown that the columns of the matrix U are the eigenvectors (i.e., called Fisherfaces) corresponding to the largest eigenvalues of the following generalized eigen- problem: • It can be shown that Sb has at most rank C-1; therefore, the max number of eigenvectors with non-zero eigenvalues is C-1, that is: max dimensionality of LDA sub-space is C-1 b k k w k S u S u  = e.g., when C=2, we always end up with one LDA feature no matter what the original number of features was!
  • 75. 75 Linear Discriminant Analysis (LDA) (cont’d) • If Sw is non-singular, we can solve a conventional eigenvalue problem as follows: • In practice, Sw is singular due to the high dimensionality of the data (e.g., images) and a much lower number of data (M << N ) 1 w b k k k S S u u  − = b k k w k S u S u  =
  • 76. 76 Linear Discriminant Analysis (LDA) (cont’d) • To alleviate this problem, PCA could be applied first: 1) First, apply PCA to reduce data dimensionality: 2) Then, apply LDA to find the most discriminative directions: 1 1 2 2 . . . . PCA N M x y x y x y                 = ⎯⎯⎯ → =                 x y 1 1 2 2 . . . . LDA M K y z y z y z                 = ⎯⎯⎯ → =                 y z
  • 77. 77 Case Study I − D. Swets, J. Weng, "Using Discriminant Eigenfeatures for Image Retrieval", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 831-836, 1996. • Content-based image retrieval: − Application: query-by-example content-based image retrieval − Question: how to select a good set of image features?
  • 78. 78 Case Study I (cont’d) • Assumptions − Well-framed images are required as input for training and query-by- example test probes. − Only a small variation in the size, position, and orientation of the objects in the images is allowed.
  • 79. 79 Case Study I (cont’d) • Terminology − Most Expressive Features (MEF): features obtained using PCA. − Most Discriminating Features (MDF): features obtained using LDA. • Numerical instabilities − Computing the eigenvalues/eigenvectors of Sw -1SBuk = kuk could lead to unstable computations since Sw -1SB is not always symmetric. − Check the paper for more details about how to deal with this issue.
  • 80. 80 Case Study I (cont’d) • Comparing projection directions between MEF with MDF: − PCA eigenvectors show the tendency of PCA to capture major variations in the training set such as lighting direction. − LDA eigenvectors discount those factors unrelated to classification.
  • 81. 81 Case Study I (cont’d) • Clustering effect PCA space LDA space
  • 82. 82 Case Study I (cont’d) 1) Represent each training image in terms of MDFs (or MEFs for comparison). 2) Represent a query image in terms of MDFs (or MEFs for comparson). 3) Find the k closest neighbors (e.g., using Euclidean distance). • Methodology
  • 83. 83 Case Study I (cont’d) • Experiments and results Face images − A set of face images was used with 2 expressions, 3 lighting conditions. − Testing was performed using a disjoint set of images.
  • 84. 84 Case Study I (cont’d) Top match (k=1)
  • 85. 85 Case Study I (cont’d) − Examples of correct search probes
  • 86. 86 Case Study I (cont’d) − Example of a failed search probe
  • 87. 87 Case Study II − A. Martinez, A. Kak, "PCA versus LDA", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228- 233, 2001. • Is LDA always better than PCA? − There has been a tendency in the computer vision community to prefer LDA over PCA. − This is mainly because LDA deals directly with discrimination between classes while PCA does not pay attention to the underlying class structure.
  • 88. 88 Case Study II (cont’d) AR database
  • 89. 89 Case Study II (cont’d) LDA is not always better when the training set is small PCA w/o 3: not using the first three principal components that seem to encode mostly variations due to lighting
  • 90. 90 Case Study II (cont’d) LDA outperforms PCA when the training set is large PCA w/o 3: not using the first three principal components that seem to encode mostly variations due to lighting