2. CS771: Intro to ML
K-means loss function: recap
2
X Z
N
K
K
ππ = [π§π1, π§π2, β¦ , π§ππΎ]
denotes a length πΎ one-
hot encoding of ππ
β’ Remember the matrix factorization view of the k-means loss function?
β’ We approximated an N x D matrix with
β’ An NxK matrix and a
β’ KXD matrix
β’ This could be storage efficient if K is much smaller than D
D
D
3. CS771: Intro to ML
Dimensionality Reduction
3
ο§ A broad class of techniques
ο§ Goal is to compress the original representation of the inputs
ο§ Example: Approximate each input ππ β βπ·, π = 1,2, β¦ , π as a linear
combination of πΎ < min{π·, π} βbasisβ vectors π1, π2, β¦ , ππΎ, each
also β βπ·
ο§ We have represented each ππ β βπ· by a πΎ-dim vector ππ (a new feat.
rep)
ο§ To store π such inputs {ππ}π=1
π
, we need to keep π and {ππ}π=1
π
ο§ Originally we required π Γ π· storage, now π Γ πΎ + π· Γ πΎ = π + π· Γ πΎ
storage
ππ β
π=1
πΎ
π§ππππ = πππ
π = π1, π2, β¦ , ππΎ is π· Γ πΎ
ππ = π§π1, π§π2, β¦ , π§ππΎ is πΎ Γ 1
Note: These βbasisβ vectors need not
necessarily be linearly independent. But
for some dim. red. techniques, e.g.,
classic principal component analysis
(PCA), they are
Can think of W as a linear mapping
that transforms low-dim ππ to high-
dim ππ
Some dim-red techniques assume a
nonlinear mapping function π such that
ππ = π(ππ)
For example, π can be
modeled by a kernel or a
deep neural net
4. CS771: Intro to ML
Dimensionality Reduction
4
ο§ Dim-red for face images
ο§ In this example, ππ β βπΎ (πΎ = 4) is a low-dim feature rep. for ππ β βπ·
ο§ Essentially, each face image in the dataset now represented by just 4 real
numbers ο
ο§ Different dim-red algos differ in terms of how the basis vectors are
defined/learned
A face
image
ππ β βπ·
K=4 βbasisβ face
images
Each βbasisβ image is like a
βtemplateβ that captures the
common properties of face images
in the dataset
π§π1 π§π2 π§π3 π§π4
π1 π2 π3
π4
+
Like 4 new
features
5. CS771: Intro to ML
Principal Component Analysis (PCA)
5
ο§ A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
ο§ Can be seen as
ο§ Learning directions (co-ordinate axes) that capture maximum variance in data
ο§ Learning projection directions that result in smallest reconstruction error
ο§ PCA also assumes that the projection directions are orthonormal
argmin
πΎ,π π=1
π
ππ β πΎππ
2 = argmin
πΎ,π
πΏ β ππΎ 2
π€2
π€1
PCA is essentially doing a change of
axes in which we are representing the
data
π1
π2
π1, π2: Standard co-ordinate axis (π = [π₯1, π₯2])
π€1, π€2: New co-ordinate axis (π = [π§1, π§2])
Each input will still have 2 co-
ordinates, in the new co-ordinate
system, equal to the distances
measured from the new origin
To reduce dimension, can only keep the co-
ordinates of those directions that have largest
variances (e.g., in this example, if we want to
reduce to one-dim, we can keep the co-
ordinate π§1 of each point along π€1 and throw
away π§2). We wonβt lose much information Subject to
orthonormality
constraints: ππ
β€
ππ =
0 for π β π and ππ
2
=
1
6. CS771: Intro to ML
Principal Component Analysis: the algorithm
6
ο§ Center the data (subtract the mean π =
1
π π=1
π
ππ from each data point)
ο§ Compute the π· Γ π· covariance matrix π using the centered data matrix π as
ο§ Do an eigendecomposition of the covariance matrix π (many methods exist)
ο§ Take top πΎ < π· leading eigvectors {π1, π2, β¦ , ππΎ} with eigvalues
{π1, π2, β¦ , ππΎ}
ο§ The πΎ-dimensional projection/embedding of each input is
π =
1
π
πβ€
π (Assuming π is arranged as π Γ π·)
ππ β ππΎ
β€
ππ
πK = π1, π2, β¦ , ππΎ is the
βprojection matrixβ of size
π· Γ πΎ
Note: Can decide how
many eigvecs to use based
on how much variance we
want to campure (recall that
each ππ gives the variance
in the ππ‘β
direction (and
their sum is the total
variance)
7. CS771: Intro to ML
Understanding PCA: The variance
perspective
7
8. CS771: Intro to ML
Solving PCA by Finding Max. Variance Directions
8
ο§ Consider projecting an input ππ β βπ· along a direction π1 β βπ·
ο§ Projection/embedding of ππ (red points below) will be π1
β€
ππ (green
pts below)
ο§ Want π1 such that variance π1
β€
πΊπ1 is maximized
π1
ππ
Mean of projections of all inputs:
1
π π=1
π
π1
β€
ππ = π1
β€
(
1
π π=1
π
ππ) = π1
β€
π
Variance of the projections:
1
π π=1
π
(π1
β€
ππ β π1
β€
π)2 =
1
π π=1
π
{π1
β€
(ππβπ)}2 = π1
β€
πΊπ1
πΊ is the π· Γ π· cov matrix of the
data:
πΊ =
1
π π=1
π
(ππ β π)(ππ β π)β€
For already centered data, π = π
and πΊ =
1
π π=1
π
ππ ππ
β€
=
1
π
πΏπΏβ€
argmax
π1
π1
β€
πΊπ1 s.t. π1
β€
π1 = 1
Need this constraint otherwise
the objectiveβs max will be
9. CS771: Intro to ML
Max. Variance Direction
9
ο§ Our objective function was argmax
π1
π1
β€
πΊπ1 s.t. π1
β€
π1 = 1
ο§ Can construct a Lagrangian for this problem
ο§ Taking derivative w.r.t. π1 and setting to zero gives πΊπ1 = π1π1
ο§ Therefore π1 is an eigenvector of the cov matrix πΊ with eigenvalue π1
ο§ Claim: π1 is the eigenvector of πΊ with largest eigenvalue π1. Note that
ο§ Thus variance π1
β€
πΊπ1 will be max. if π1 is the largest eigenvalue (and π1 is
the corresponding top eigenvector; also known as the first Principal
Component)
ο§ Other large variance directions can also be found likewise (with each being
argmax
π1
π1
β€
πΊπ1 + π1(1-π1
β€
π1)
π1
β€
πΊπ1 = π1π1
β€
π1 = π1
Variance along the
direction π1
Note: In general, πΊ
will have π· eigvecs
Note: Total variance
of the data is equal
to the sum of
eigenvalues of πΊ, i.e.,
π=1
π·
ππ
PCA would keep the
top πΎ < π· such
directions of largest
variances
10. CS771: Intro to ML
Understanding PCA: The reconstruction
perspective
10
11. CS771: Intro to ML
Alternate Basis and Reconstruction
11
ο§ Representing a data point ππ = [π₯π1, π₯π2, β¦ , π₯ππ·] β€
in the standard
orthonormal basis π1, π2, β¦ , ππ·
ο§ Letβs represent the same data point in a new orthonormal basis
π1, π2, β¦ , ππ·
ο§ Ignoring directions along which projection π§ππ is small, we can approximate
ππ as
ο§ Now ππ is represented by πΎ < π· dim. rep. ππ = [π§π1, π§π2, β¦ , π§ππΎ] and
(verify)
ππ =
π=1
π·
π₯ππππ
ππ is a vector of all zeros except a single 1
at the ππ‘β
position. Also, ππ
β€
ππβ² = 0 for
π β πβ
ππ =
π=1
π·
π§ππππ
ππ = [π§π1, π§π2, β¦ , π§ππ·] β€
denotes
the co-ordinates of ππ in the new
basis
π§ππ is the projection of ππ along the
direction ππ since π§ππ = ππ
β€
ππ =
ππ
β€
ππ(verify)
ππ β ππ =
π=1
πΎ
π§ππππ =
π=1
πΎ
(ππ
β€ππ)ππ =
π=1
πΎ
(ππππ
β€
)ππ
ππ β ππΎ
β€
ππ
πK = π1, π2, β¦ , ππΎ is the
βprojection matrixβ of size
π· Γ πΎ
Note that ππ β
Also, ππ β ππΎππ
12. CS771: Intro to ML
Minimizing Reconstruction Error
12
ο§ We plan to use only πΎ directions π1, π2, β¦ , ππΎ so would like them to
be such that the total reconstruction error is minimized
ο§ Each optimal ππ can be found by solving
ο§ Thus minimizing the reconstruction error is equivalent to maximizing
variance
ο§ The πΎ directions can be found by solving the eigendecomposition of π
ο§ Note: π=1
πΎ
ππ
β€
πππ = trace( ππΎ
β€
πππΎ)
ο§ Thus argmaxππΎ
trace( πΎπΎ
β€
πππΎ) s.t. orthonormality on columns of ππ is the
same as solving the eigendec. of πΊ (recall that Spectral Clustering also required
β π1, π2, β¦ , ππΎ =
π=1
π
ππ β ππ
2 =
π=1
π
ππ β
π=1
πΎ
(ππππ
β€
)ππ
2
= πΆ β π=1
πΎ
ππ
β€
πππ (verify)
Constant;
doesnβt depend
on the ππβs
Variance along
ππ
argmin
ππ
β π1, π2, β¦ , ππΎ = argmax
ππ
ππ
β€
πππ
13. CS771: Intro to ML
Principal Component Analysis
13
ο§ Center the data (subtract the mean π =
1
π π=1
π
ππ from each data point)
ο§ Compute the π· Γ π· covariance matrix π using the centered data matrix π as
ο§ Do an eigendecomposition of the covariance matrix π (many methods exist)
ο§ Take top πΎ < π· leading eigvectors {π1, π2, β¦ , ππΎ} with eigvalues
{π1, π2, β¦ , ππΎ}
ο§ The πΎ-dimensional projection/embedding of each input is
π =
1
π
πβ€
π (Assuming π is arranged as π Γ π·)
ππ β ππΎ
β€
ππ
πK = π1, π2, β¦ , ππΎ is the
βprojection matrixβ of size
π· Γ πΎ
Note: Can decide how
many eigvecs to use based
on how much variance we
want to campure (recall that
each ππ gives the variance
in the ππ‘β
direction (and
their sum is the total
variance)
14. CS771: Intro to ML
Singular Value Decomposition (SVD)
14
ο§ Any matrix π of size π Γ π· can be represented as the following
decomposition
ο§ π = π1, π2, β¦ , ππ is π Γ π matrix of left singular vectors, each ππ β
βπ
ο§ π is also orthonormal
ο§ π = π1, π2, β¦ , ππ is π· Γ π· matrix of right singular vectors, each ππ β
βπ·
ο§ π is also orthonormal
π = ππ²πβ€ =
π=1
min{π,π·}
ππππππ
β€
Diagonal matrix. If π > π·, last π· β π rows
are all zeros; if π· > π, last π· β π columns
are all zeros
15. CS771: Intro to ML
Low-Rank Approximation via SVD
15
ο§ If we just use the top πΎ < min{π, π·} singular values, we get a rank-πΎ
SVD
ο§ Above SVD approx. can be shown to minimize the reconstruction
error πΏ β πΏ
ο§ Fact: SVD gives the best rank-πΎ approximation of a matrix
ο§ PCA is done by doing SVD on the covariance matrix π (left and right
16. CS771: Intro to ML
Dim-Red as Matrix Factorization
16
ο§ If we donβt care about the orthonormality constraints, then dim-red can
also be achieved by solving a matrix factorization problem on the data
matrix π
ο§ Can solve such problems using ALT-OPT
X Z
W
N
D D
K
K
N
β
Matrix containing
the low-dim rep of
π
π, π = argminπ,π π β ππ 2 If πΎ < min{π·, π}, such a
factorization gives a low-rank
approximation of the data
matrix X