lec22 pca- DIMENSILANITY REDUCTION.pptx

Dimensionality Reduction: Principal
Component Analysis
CS771: Introduction to Machine Learning
Nisheeth

CS771: Intro to ML
K-means loss function: recap
2
X Z
N
K
K
[, ] denotes a length
one-hot encoding of
• Remember the matrix factorization view of the k-means loss function?
• We approximated an N x D matrix with
• An NxK matrix and a
• KXD matrix
• This could be storage efficient if K is much smaller than D
D
D

CS771: Intro to ML
Dimensionality Reduction
3
 A broad class of techniques
 Goal is to compress the original representation of the inputs
 Example: Approximate each input , as a linear combination of
“basis” vectors , each also
 We have represented each by a -dim vector (a new feat. rep)
 To store such inputs , we need to keep and
 Originally we required storage, now storage
 If , this yields substantial storage saving, hence good compression
𝒙𝑛 ≈ ∑
𝑘=1
𝐾
𝑧𝑛𝑘 𝒘𝑘= 𝐖 𝒛 𝑛
is
is
Note: These “basis” vectors need not
necessarily be linearly independent.
But for some dim. red. techniques,
e.g., classic principal component
analysis (PCA), they are
Can think of W as a linear mapping
that transforms low-dim to high-
dim
Some dim-red techniques assume a
nonlinear mapping function such that
For example, can be
modeled by a kernel or a
deep neural net

CS771: Intro to ML
Dimensionality Reduction
4
 Dim-red for face images
 In this example, is a low-dim feature rep. for
 Essentially, each face image in the dataset now represented by just 4 real
numbers 
 Different dim-red algos differ in terms of how the basis vectors are
defined/learned
A face
image
K=4 “basis” face
images
Each “basis” image is like a
“template” that captures the
common properties of face
images in the dataset
𝑧𝑛 1 𝑧𝑛 2 𝑧𝑛 3 𝑧𝑛 4
𝒘1 𝒘 2 𝒘 3
𝒘 4
+
Like 4 new
features

CS771: Intro to ML
Principal Component Analysis (PCA)
5
 A classic linear dim. reduction method (Pearson, 1901; Hotelling,
1930)
 Can be seen as
 Learning directions (co-ordinate axes) that capture maximum variance in data
 Learning projection directions that result in smallest reconstruction error
arg min
𝑾 , 𝒁
∑
𝑛=1
𝑁
‖𝒙𝑛 −𝑾 𝒛𝑛‖
2
=arg min
𝑾 , 𝒁
‖𝑿 − 𝒁𝑾‖
2
𝑤2
𝑤1
PCA is essentially doing a change of
axes in which we are representing
the data
𝑒1
𝑒2
Standard co-ordinate axis ()
New co-ordinate axis ()
Each input will still have 2 co-
ordinates, in the new co-ordinate
system, equal to the distances
measured from the new origin
To reduce dimension, can only keep the co-
ordinates of those directions that have largest
variances (e.g., in this example, if we want to
reduce to one-dim, we can keep the co-
ordinate of each point along and throw away
). We won’t lose much information
Subject to
orthonormality
constraints: for and
Large variance
S
m
a
l
l
v
a
r
i
a
n
c
e

CS771: Intro to ML
Principal Component Analysis: the algorithm
6
 Center the data (subtract the mean from each data point)
 Compute the covariance matrix using the centered data matrix as
 Do an eigendecomposition of the covariance matrix (many methods
exist)
 Take top leading eigvectors with eigvalues
 The -dimensional projection/embedding of each input is
𝐒=
1
𝑁
𝐗
⊤
𝐗 (Assuming is arranged as )
𝒛𝑛 ≈ 𝐖 𝐾
⊤
𝒙𝑛 is the “projection matrix” of
size
Note: Can decide how
many eigvecs to use based
on how much variance we
want to campure (recall
that each gives the
variance in the direction
(and their sum is the total
variance)

CS771: Intro to ML
Understanding PCA: The variance perspective
7

CS771: Intro to ML
Solving PCA by Finding Max. Variance Directions
8
 Consider projecting an input along a direction
 Projection/embedding of (red points below) will be (green pts
below)
 Want such that variance is maximized
𝒘1
𝒙𝑛
𝒘 1
⊤ 𝒙 𝑛
Mean of projections of all inputs:
Variance of the projections:
is the cov matrix of the data:
For already centered data, = and
=
s.t.
Need this constraint otherwise
the objective’s max will be

CS771: Intro to ML
Max. Variance Direction
9
 Our objective function was s.t.
 Can construct a Lagrangian for this problem
 Taking derivative w.r.t. and setting to zero gives
 Therefore is an eigenvector of the cov matrix with eigenvalue
 Claim: is the eigenvector of with largest eigenvalue . Note that
 Thus variance will be max. if is the largest eigenvalue (and is the
corresponding top eigenvector; also known as the first Principal
Component)
 Other large variance directions can also be found likewise (with each
1-
Variance along
the direction
Note: In general,
will have eigvecs
Note: Total variance
of the data is equal
to the sum of
eigenvalues of , i.e.,
PCA would keep the
top such directions
of largest variances

CS771: Intro to ML
Understanding PCA: The reconstruction perspective
10

CS771: Intro to ML
Alternate Basis and Reconstruction
11
 Representing a data point in the standard orthonormal basis
 Let’s represent the same data point in a new orthonormal basis
 Ignoring directions along which projection is small, we can approximate
as
 Now is represented by dim. rep. and (verify)
𝒙𝑛=∑
𝑑=1
𝐷
𝑥𝑛𝑑 𝒆𝑑
is a vector of all zeros except a single 1
at the position. Also, for
𝒙𝑛=∑
𝑑=1
𝐷
𝑧 𝑛𝑑 𝒘𝑑
denotes the co-ordinates of in
the new basis
is the projection of along the
direction since (verify)
𝒙𝑛 ≈ ^
𝒙𝑛=∑
𝑑=1
𝐾
𝑧𝑛𝑑 𝒘𝑑 ¿ ∑
𝑑=1
𝐾
( 𝒙¿¿𝑛⊤
𝒘𝑑)𝒘𝑑 ¿ ¿ ∑
𝑑=1
𝐾
(𝒘𝑑 𝒘¿¿𝑑
⊤
) 𝒙𝑛 ¿
⊤
𝒙𝑛
is the “projection matrix” of
size
Note that is the reconstruction
error on . Would like it to
minimize w.r.t.
Also,

CS771: Intro to ML
Minimizing Reconstruction Error
12
 We plan to use only directions so would like them to be such that
the total reconstruction error is minimized
 Each optimal can be found by solving
 Thus minimizing the reconstruction error is equivalent to
maximizing variance
 The directions can be found by solving the eigendecomposition of
 Note:
 Thus s.t. orthonormality on columns of is the same as solving the
eigendec. of (recall that Spectral Clustering also required solving this)
ℒ ( 𝒘1 , 𝒘2 , … , 𝒘 𝐾 )=∑
𝑛=1
𝑁
‖𝒙 𝑛 − ^
𝒙 𝑛‖
2
=∑
𝑛=1
𝑁
¿ ¿ ¿ ¿ (verify)
Constant; doesn’t
depend on the ’s
Variance along

CS771: Intro to ML
Principal Component Analysis
13
 Center the data (subtract the mean from each data point)
 Compute the covariance matrix using the centered data matrix as
 Do an eigendecomposition of the covariance matrix (many methods
exist)
 Take top leading eigvectors with eigvalues
 The -dimensional projection/embedding of each input is
𝐒=
1
𝑁
𝐗
⊤
𝐗 (Assuming is arranged as )
⊤
𝒙𝑛 is the “projection matrix” of
size
Note: Can decide how
many eigvecs to use based
on how much variance we
want to campure (recall
that each gives the
variance in the direction
(and their sum is the total
variance)

CS771: Intro to ML
Singular Value Decomposition (SVD)
14
 Any matrix of size can be represented as the following
decomposition
 is matrix of left singular vectors, each
 is also orthonormal
 is matrix of right singular vectors, each
 is also orthonormal
 is with only diagonal entries - singular values
 Note: If is symmetric then it is known as eigenvalue decomposition ()
𝐗=𝐔 𝚲 𝐕⊤
= ∑
𝑘=1
min {𝑁 , 𝐷 }
𝜆𝑘 𝒖𝑘 𝒗𝑘
⊤
Diagonal matrix. If , last rows are all
zeros; if , last columns are all zeros

CS771: Intro to ML
Low-Rank Approximation via SVD
15
 If we just use the top singular values, we get a rank- SVD
 Above SVD approx. can be shown to minimize the reconstruction
error
 Fact: SVD gives the best rank- approximation of a matrix
 PCA is done by doing SVD on the covariance matrix (left and right
singular vectors are the same and become eigenvectors, singular values

CS771: Intro to ML
Dim-Red as Matrix Factorization
16
 If we don’t care about the orthonormality constraints, then dim-red can
also be achieved by solving a matrix factorization problem on the data
matrix
 Can solve such problems using ALT-OPT
X Z
W
N
D D
K
K
N
≈
Matrix containing
the low-dim rep of
{^
𝐙 , ^
𝐖 }=argmin𝐙, 𝐖 ‖𝐗 −𝐙𝐖‖
2
If , such a factorization gives a
low-rank approximation of the
data matrix X

lec22 pca- DIMENSILANITY REDUCTION.pptx

More Related Content

Similar to lec22 pca- DIMENSILANITY REDUCTION.pptx

More from geethar79

Recently uploaded

lec22 pca- DIMENSILANITY REDUCTION.pptx