SlideShare a Scribd company logo
1 of 16
Dimensionality Reduction: Principal
Component Analysis
CS771: Introduction to Machine Learning
Nisheeth
CS771: Intro to ML
K-means loss function: recap
2
X Z
N
K
K
𝒛𝑛 = [𝑧𝑛1, 𝑧𝑛2, … , 𝑧𝑛𝐾]
denotes a length 𝐾 one-
hot encoding of 𝒙𝑛
β€’ Remember the matrix factorization view of the k-means loss function?
β€’ We approximated an N x D matrix with
β€’ An NxK matrix and a
β€’ KXD matrix
β€’ This could be storage efficient if K is much smaller than D
D
D
CS771: Intro to ML
Dimensionality Reduction
3
 A broad class of techniques
 Goal is to compress the original representation of the inputs
 Example: Approximate each input 𝒙𝑛 ∈ ℝ𝐷, 𝑛 = 1,2, … , 𝑁 as a linear
combination of 𝐾 < min{𝐷, 𝑁} β€œbasis” vectors π’˜1, π’˜2, … , π’˜πΎ, each
also ∈ ℝ𝐷
 We have represented each 𝒙𝑛 ∈ ℝ𝐷 by a 𝐾-dim vector 𝒛𝑛 (a new feat.
rep)
 To store 𝑁 such inputs {𝒙𝑛}𝑛=1
𝑁
, we need to keep 𝐖 and {𝒛𝑛}𝑛=1
𝑁
 Originally we required 𝑁 Γ— 𝐷 storage, now 𝑁 Γ— 𝐾 + 𝐷 Γ— 𝐾 = 𝑁 + 𝐷 Γ— 𝐾
storage
𝒙𝑛 β‰ˆ
π‘˜=1
𝐾
π‘§π‘›π‘˜π’˜π‘˜ = 𝐖𝒛𝑛
𝐖 = π’˜1, π’˜2, … , π’˜πΎ is 𝐷 Γ— 𝐾
𝒛𝑛 = 𝑧𝑛1, 𝑧𝑛2, … , 𝑧𝑛𝐾 is 𝐾 Γ— 1
Note: These β€œbasis” vectors need not
necessarily be linearly independent. But
for some dim. red. techniques, e.g.,
classic principal component analysis
(PCA), they are
Can think of W as a linear mapping
that transforms low-dim 𝒛𝑛 to high-
dim 𝒙𝑛
Some dim-red techniques assume a
nonlinear mapping function 𝑓 such that
𝒙𝑛 = 𝑓(𝒛𝑛)
For example, 𝑓 can be
modeled by a kernel or a
deep neural net
CS771: Intro to ML
Dimensionality Reduction
4
 Dim-red for face images
 In this example, 𝒛𝑛 ∈ ℝ𝐾 (𝐾 = 4) is a low-dim feature rep. for 𝒙𝑛 ∈ ℝ𝐷
 Essentially, each face image in the dataset now represented by just 4 real
numbers 
 Different dim-red algos differ in terms of how the basis vectors are
defined/learned
A face
image
𝒙𝑛 ∈ ℝ𝐷
K=4 β€œbasis” face
images
Each β€œbasis” image is like a
β€œtemplate” that captures the
common properties of face images
in the dataset
𝑧𝑛1 𝑧𝑛2 𝑧𝑛3 𝑧𝑛4
π’˜1 π’˜2 π’˜3
π’˜4
+
Like 4 new
features
CS771: Intro to ML
Principal Component Analysis (PCA)
5
 A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
 Can be seen as
 Learning directions (co-ordinate axes) that capture maximum variance in data
 Learning projection directions that result in smallest reconstruction error
 PCA also assumes that the projection directions are orthonormal
argmin
𝑾,𝒁 𝑛=1
𝑁
𝒙𝑛 βˆ’ 𝑾𝒛𝑛
2 = argmin
𝑾,𝒁
𝑿 βˆ’ 𝒁𝑾 2
𝑀2
𝑀1
PCA is essentially doing a change of
axes in which we are representing the
data
𝑒1
𝑒2
𝑒1, 𝑒2: Standard co-ordinate axis (𝒙 = [π‘₯1, π‘₯2])
𝑀1, 𝑀2: New co-ordinate axis (𝒛 = [𝑧1, 𝑧2])
Each input will still have 2 co-
ordinates, in the new co-ordinate
system, equal to the distances
measured from the new origin
To reduce dimension, can only keep the co-
ordinates of those directions that have largest
variances (e.g., in this example, if we want to
reduce to one-dim, we can keep the co-
ordinate 𝑧1 of each point along 𝑀1 and throw
away 𝑧2). We won’t lose much information Subject to
orthonormality
constraints: π’˜π‘–
⊀
π’˜π‘— =
0 for 𝑖 β‰  𝑗 and π’˜π‘–
2
=
1
CS771: Intro to ML
Principal Component Analysis: the algorithm
6
 Center the data (subtract the mean 𝝁 =
1
𝑁 𝑛=1
𝑁
𝒙𝑛 from each data point)
 Compute the 𝐷 Γ— 𝐷 covariance matrix 𝐒 using the centered data matrix 𝐗 as
 Do an eigendecomposition of the covariance matrix 𝐒 (many methods exist)
 Take top 𝐾 < 𝐷 leading eigvectors {π’˜1, π’˜2, … , π’˜πΎ} with eigvalues
{πœ†1, πœ†2, … , πœ†πΎ}
 The 𝐾-dimensional projection/embedding of each input is
𝐒 =
1
𝑁
π—βŠ€
𝐗 (Assuming 𝐗 is arranged as 𝑁 Γ— 𝐷)
𝒛𝑛 β‰ˆ 𝐖𝐾
⊀
𝒙𝑛
𝐖K = π’˜1, π’˜2, … , π’˜πΎ is the
β€œprojection matrix” of size
𝐷 Γ— 𝐾
Note: Can decide how
many eigvecs to use based
on how much variance we
want to campure (recall that
each πœ†π‘˜ gives the variance
in the π‘˜π‘‘β„Ž
direction (and
their sum is the total
variance)
CS771: Intro to ML
Understanding PCA: The variance
perspective
7
CS771: Intro to ML
Solving PCA by Finding Max. Variance Directions
8
 Consider projecting an input 𝒙𝑛 ∈ ℝ𝐷 along a direction π’˜1 ∈ ℝ𝐷
 Projection/embedding of 𝒙𝑛 (red points below) will be π’˜1
⊀
𝒙𝑛 (green
pts below)
 Want π’˜1 such that variance π’˜1
⊀
π‘Ίπ’˜1 is maximized
π’˜1
𝒙𝑛
Mean of projections of all inputs:
1
𝑁 𝑛=1
𝑁
π’˜1
⊀
𝒙𝑛 = π’˜1
⊀
(
1
𝑁 𝑛=1
𝑁
𝒙𝑛) = π’˜1
⊀
𝝁
Variance of the projections:
1
𝑁 𝑛=1
𝑁
(π’˜1
⊀
𝒙𝑛 βˆ’ π’˜1
⊀
𝝁)2 =
1
𝑁 𝑛=1
𝑁
{π’˜1
⊀
(π’™π‘›βˆ’π)}2 = π’˜1
⊀
π‘Ίπ’˜1
𝑺 is the 𝐷 Γ— 𝐷 cov matrix of the
data:
𝑺 =
1
𝑁 𝑛=1
𝑁
(𝒙𝑛 βˆ’ 𝝁)(𝒙𝑛 βˆ’ 𝝁)⊀
For already centered data, 𝝁 = 𝟎
and 𝑺 =
1
𝑁 𝑛=1
𝑁
𝒙𝑛 𝒙𝑛
⊀
=
1
𝑁
π‘Ώπ‘ΏβŠ€
argmax
π’˜1
π’˜1
⊀
π‘Ίπ’˜1 s.t. π’˜1
⊀
π’˜1 = 1
Need this constraint otherwise
the objective’s max will be
CS771: Intro to ML
Max. Variance Direction
9
 Our objective function was argmax
π’˜1
π’˜1
⊀
π‘Ίπ’˜1 s.t. π’˜1
⊀
π’˜1 = 1
 Can construct a Lagrangian for this problem
 Taking derivative w.r.t. π’˜1 and setting to zero gives π‘Ίπ’˜1 = πœ†1π’˜1
 Therefore π’˜1 is an eigenvector of the cov matrix 𝑺 with eigenvalue πœ†1
 Claim: π’˜1 is the eigenvector of 𝑺 with largest eigenvalue πœ†1. Note that
 Thus variance π’˜1
⊀
π‘Ίπ’˜1 will be max. if πœ†1 is the largest eigenvalue (and π’˜1 is
the corresponding top eigenvector; also known as the first Principal
Component)
 Other large variance directions can also be found likewise (with each being
argmax
π’˜1
π’˜1
⊀
π‘Ίπ’˜1 + πœ†1(1-π’˜1
⊀
π’˜1)
π’˜1
⊀
π‘Ίπ’˜1 = πœ†1π’˜1
⊀
π’˜1 = πœ†1
Variance along the
direction π’˜1
Note: In general, 𝑺
will have 𝐷 eigvecs
Note: Total variance
of the data is equal
to the sum of
eigenvalues of 𝑺, i.e.,
𝑑=1
𝐷
πœ†π‘‘
PCA would keep the
top 𝐾 < 𝐷 such
directions of largest
variances
CS771: Intro to ML
Understanding PCA: The reconstruction
perspective
10
CS771: Intro to ML
Alternate Basis and Reconstruction
11
 Representing a data point 𝒙𝑛 = [π‘₯𝑛1, π‘₯𝑛2, … , π‘₯𝑛𝐷] ⊀
in the standard
orthonormal basis 𝒆1, 𝒆2, … , 𝒆𝐷
 Let’s represent the same data point in a new orthonormal basis
π’˜1, π’˜2, … , π’˜π·
 Ignoring directions along which projection 𝑧𝑛𝑑 is small, we can approximate
𝒙𝑛 as
 Now 𝒙𝑛 is represented by 𝐾 < 𝐷 dim. rep. 𝒛𝑛 = [𝑧𝑛1, 𝑧𝑛2, … , 𝑧𝑛𝐾] and
(verify)
𝒙𝑛 =
𝑑=1
𝐷
π‘₯𝑛𝑑𝒆𝑑
𝒆𝑑 is a vector of all zeros except a single 1
at the π‘‘π‘‘β„Ž
position. Also, 𝒆𝑑
⊀
𝒆𝑑′ = 0 for
𝑑 β‰  𝑑’
𝒙𝑛 =
𝑑=1
𝐷
π‘§π‘›π‘‘π’˜π‘‘
𝒛𝑛 = [𝑧𝑛1, 𝑧𝑛2, … , 𝑧𝑛𝐷] ⊀
denotes
the co-ordinates of 𝒙𝑛 in the new
basis
𝑧𝑛𝑑 is the projection of 𝒙𝑛 along the
direction π’˜π‘‘ since 𝑧𝑛𝑑 = π’˜π‘‘
⊀
𝒙𝑛 =
𝒙𝑛
⊀
π’˜π‘‘(verify)
𝒙𝑛 β‰ˆ 𝒙𝑛 =
𝑑=1
𝐾
π‘§π‘›π‘‘π’˜π‘‘ =
𝑑=1
𝐾
(𝒙𝑛
βŠ€π’˜π‘‘)π’˜π‘‘ =
𝑑=1
𝐾
(π’˜π‘‘π’˜π‘‘
⊀
)𝒙𝑛
𝒛𝑛 β‰ˆ 𝐖𝐾
⊀
𝒙𝑛
𝐖K = π’˜1, π’˜2, … , π’˜πΎ is the
β€œprojection matrix” of size
𝐷 Γ— 𝐾
Note that 𝒙𝑛 βˆ’
Also, 𝒙𝑛 β‰ˆ 𝐖𝐾𝒛𝑛
CS771: Intro to ML
Minimizing Reconstruction Error
12
 We plan to use only 𝐾 directions π’˜1, π’˜2, … , π’˜πΎ so would like them to
be such that the total reconstruction error is minimized
 Each optimal π’˜π‘‘ can be found by solving
 Thus minimizing the reconstruction error is equivalent to maximizing
variance
 The 𝐾 directions can be found by solving the eigendecomposition of 𝐒
 Note: 𝑑=1
𝐾
π’˜π‘‘
⊀
π’π’˜π‘‘ = trace( 𝐖𝐾
⊀
𝐒𝐖𝐾)
 Thus argmax𝐖𝐾
trace( 𝑾𝐾
⊀
𝐒𝐖𝐾) s.t. orthonormality on columns of π–π‘˜ is the
same as solving the eigendec. of 𝑺 (recall that Spectral Clustering also required
β„’ π’˜1, π’˜2, … , π’˜πΎ =
𝑛=1
𝑁
𝒙𝑛 βˆ’ 𝒙𝑛
2 =
𝑛=1
𝑁
𝒙𝑛 βˆ’
𝑑=1
𝐾
(π’˜π‘‘π’˜π‘‘
⊀
)𝒙𝑛
2
= 𝐢 βˆ’ 𝑑=1
𝐾
π’˜π‘‘
⊀
π’π’˜π‘‘ (verify)
Constant;
doesn’t depend
on the π’˜π‘‘β€™s
Variance along
π’˜π‘‘
argmin
π’˜π‘‘
β„’ π’˜1, π’˜2, … , π’˜πΎ = argmax
π’˜π‘‘
π’˜π‘‘
⊀
π’π’˜π‘‘
CS771: Intro to ML
Principal Component Analysis
13
 Center the data (subtract the mean 𝝁 =
1
𝑁 𝑛=1
𝑁
𝒙𝑛 from each data point)
 Compute the 𝐷 Γ— 𝐷 covariance matrix 𝐒 using the centered data matrix 𝐗 as
 Do an eigendecomposition of the covariance matrix 𝐒 (many methods exist)
 Take top 𝐾 < 𝐷 leading eigvectors {π’˜1, π’˜2, … , π’˜πΎ} with eigvalues
{πœ†1, πœ†2, … , πœ†πΎ}
 The 𝐾-dimensional projection/embedding of each input is
𝐒 =
1
𝑁
π—βŠ€
𝐗 (Assuming 𝐗 is arranged as 𝑁 Γ— 𝐷)
𝒛𝑛 β‰ˆ 𝐖𝐾
⊀
𝒙𝑛
𝐖K = π’˜1, π’˜2, … , π’˜πΎ is the
β€œprojection matrix” of size
𝐷 Γ— 𝐾
Note: Can decide how
many eigvecs to use based
on how much variance we
want to campure (recall that
each πœ†π‘˜ gives the variance
in the π‘˜π‘‘β„Ž
direction (and
their sum is the total
variance)
CS771: Intro to ML
Singular Value Decomposition (SVD)
14
 Any matrix 𝐗 of size 𝑁 Γ— 𝐷 can be represented as the following
decomposition
 𝐔 = 𝒖1, 𝒖2, … , 𝒖𝑁 is 𝑁 Γ— 𝑁 matrix of left singular vectors, each 𝒖𝑛 ∈
ℝ𝑁
 𝐔 is also orthonormal
 𝐕 = 𝒗1, 𝒗2, … , 𝒗𝑁 is 𝐷 Γ— 𝐷 matrix of right singular vectors, each 𝒗𝑑 ∈
ℝ𝐷
 𝐕 is also orthonormal
𝐗 = π”πš²π•βŠ€ =
π‘˜=1
min{𝑁,𝐷}
πœ†π‘˜π’–π‘˜π’—π‘˜
⊀
Diagonal matrix. If 𝑁 > 𝐷, last 𝐷 βˆ’ 𝑁 rows
are all zeros; if 𝐷 > 𝑁, last 𝐷 βˆ’ 𝑁 columns
are all zeros
CS771: Intro to ML
Low-Rank Approximation via SVD
15
 If we just use the top 𝐾 < min{𝑁, 𝐷} singular values, we get a rank-𝐾
SVD
 Above SVD approx. can be shown to minimize the reconstruction
error 𝑿 βˆ’ 𝑿
 Fact: SVD gives the best rank-𝐾 approximation of a matrix
 PCA is done by doing SVD on the covariance matrix 𝐒 (left and right
CS771: Intro to ML
Dim-Red as Matrix Factorization
16
 If we don’t care about the orthonormality constraints, then dim-red can
also be achieved by solving a matrix factorization problem on the data
matrix 𝐗
 Can solve such problems using ALT-OPT
X Z
W
N
D D
K
K
N
β‰ˆ
Matrix containing
the low-dim rep of
𝐗
𝐙, 𝐖 = argmin𝐙,𝐖 𝐗 βˆ’ 𝐙𝐖 2 If 𝐾 < min{𝐷, 𝑁}, such a
factorization gives a low-rank
approximation of the data
matrix X

More Related Content

Similar to Machine learning ppt and presentation code

Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
Β 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...AmirParnianifard1
Β 
2 borgs
2 borgs2 borgs
2 borgsYandex
Β 
MODULE_05-Matrix Decomposition.pptx
MODULE_05-Matrix Decomposition.pptxMODULE_05-Matrix Decomposition.pptx
MODULE_05-Matrix Decomposition.pptxAlokSingh205089
Β 
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...mathsjournal
Β 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
Β 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notesUmeshJagga1
Β 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAminaRepo
Β 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
Β 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
Β 
Symbolic Computation via GrΓΆbner Basis
Symbolic Computation via GrΓΆbner BasisSymbolic Computation via GrΓΆbner Basis
Symbolic Computation via GrΓΆbner BasisIJERA Editor
Β 
Adaptive filtersfinal
Adaptive filtersfinalAdaptive filtersfinal
Adaptive filtersfinalWiw Miu
Β 
Lecture 11 linear regression
Lecture 11 linear regressionLecture 11 linear regression
Lecture 11 linear regressionMostafa El-Hosseini
Β 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx36rajneekant
Β 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...TELKOMNIKA JOURNAL
Β 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!ChenYiHuang5
Β 

Similar to Machine learning ppt and presentation code (20)

Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Β 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Β 
2 borgs
2 borgs2 borgs
2 borgs
Β 
MODULE_05-Matrix Decomposition.pptx
MODULE_05-Matrix Decomposition.pptxMODULE_05-Matrix Decomposition.pptx
MODULE_05-Matrix Decomposition.pptx
Β 
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...
A Probabilistic Algorithm for Computation of Polynomial Greatest Common with ...
Β 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
Β 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notes
Β 
1. linear model, inference, prediction
1. linear model, inference, prediction1. linear model, inference, prediction
1. linear model, inference, prediction
Β 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Β 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
Β 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
Β 
Pca ppt
Pca pptPca ppt
Pca ppt
Β 
RTSP Report
RTSP ReportRTSP Report
RTSP Report
Β 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
Β 
Symbolic Computation via GrΓΆbner Basis
Symbolic Computation via GrΓΆbner BasisSymbolic Computation via GrΓΆbner Basis
Symbolic Computation via GrΓΆbner Basis
Β 
Adaptive filtersfinal
Adaptive filtersfinalAdaptive filtersfinal
Adaptive filtersfinal
Β 
Lecture 11 linear regression
Lecture 11 linear regressionLecture 11 linear regression
Lecture 11 linear regression
Β 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
Β 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
Β 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
Β 

Recently uploaded

Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
Β 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
Β 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage examplePragyanshuParadkar1
Β 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
Β 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
Β 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
Β 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
Β 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
Β 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
Β 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
Β 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIkoyaldeepu123
Β 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
Β 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
Β 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
Β 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
Β 
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
Β 

Recently uploaded (20)

Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
Β 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
Β 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage example
Β 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
Β 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
Β 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
Β 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
Β 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
Β 
young call girls in Green ParkπŸ” 9953056974 πŸ” escort Service
young call girls in Green ParkπŸ” 9953056974 πŸ” escort Serviceyoung call girls in Green ParkπŸ” 9953056974 πŸ” escort Service
young call girls in Green ParkπŸ” 9953056974 πŸ” escort Service
Β 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Β 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Β 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
Β 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AI
Β 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
Β 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
Β 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
Β 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
Β 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
Β 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
Β 
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✑️9711147426✨Call In girls Gurgaon Sector 51 escort service
Β 

Machine learning ppt and presentation code

  • 1. Dimensionality Reduction: Principal Component Analysis CS771: Introduction to Machine Learning Nisheeth
  • 2. CS771: Intro to ML K-means loss function: recap 2 X Z N K K 𝒛𝑛 = [𝑧𝑛1, 𝑧𝑛2, … , 𝑧𝑛𝐾] denotes a length 𝐾 one- hot encoding of 𝒙𝑛 β€’ Remember the matrix factorization view of the k-means loss function? β€’ We approximated an N x D matrix with β€’ An NxK matrix and a β€’ KXD matrix β€’ This could be storage efficient if K is much smaller than D D D
  • 3. CS771: Intro to ML Dimensionality Reduction 3  A broad class of techniques  Goal is to compress the original representation of the inputs  Example: Approximate each input 𝒙𝑛 ∈ ℝ𝐷, 𝑛 = 1,2, … , 𝑁 as a linear combination of 𝐾 < min{𝐷, 𝑁} β€œbasis” vectors π’˜1, π’˜2, … , π’˜πΎ, each also ∈ ℝ𝐷  We have represented each 𝒙𝑛 ∈ ℝ𝐷 by a 𝐾-dim vector 𝒛𝑛 (a new feat. rep)  To store 𝑁 such inputs {𝒙𝑛}𝑛=1 𝑁 , we need to keep 𝐖 and {𝒛𝑛}𝑛=1 𝑁  Originally we required 𝑁 Γ— 𝐷 storage, now 𝑁 Γ— 𝐾 + 𝐷 Γ— 𝐾 = 𝑁 + 𝐷 Γ— 𝐾 storage 𝒙𝑛 β‰ˆ π‘˜=1 𝐾 π‘§π‘›π‘˜π’˜π‘˜ = 𝐖𝒛𝑛 𝐖 = π’˜1, π’˜2, … , π’˜πΎ is 𝐷 Γ— 𝐾 𝒛𝑛 = 𝑧𝑛1, 𝑧𝑛2, … , 𝑧𝑛𝐾 is 𝐾 Γ— 1 Note: These β€œbasis” vectors need not necessarily be linearly independent. But for some dim. red. techniques, e.g., classic principal component analysis (PCA), they are Can think of W as a linear mapping that transforms low-dim 𝒛𝑛 to high- dim 𝒙𝑛 Some dim-red techniques assume a nonlinear mapping function 𝑓 such that 𝒙𝑛 = 𝑓(𝒛𝑛) For example, 𝑓 can be modeled by a kernel or a deep neural net
  • 4. CS771: Intro to ML Dimensionality Reduction 4  Dim-red for face images  In this example, 𝒛𝑛 ∈ ℝ𝐾 (𝐾 = 4) is a low-dim feature rep. for 𝒙𝑛 ∈ ℝ𝐷  Essentially, each face image in the dataset now represented by just 4 real numbers   Different dim-red algos differ in terms of how the basis vectors are defined/learned A face image 𝒙𝑛 ∈ ℝ𝐷 K=4 β€œbasis” face images Each β€œbasis” image is like a β€œtemplate” that captures the common properties of face images in the dataset 𝑧𝑛1 𝑧𝑛2 𝑧𝑛3 𝑧𝑛4 π’˜1 π’˜2 π’˜3 π’˜4 + Like 4 new features
  • 5. CS771: Intro to ML Principal Component Analysis (PCA) 5  A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)  Can be seen as  Learning directions (co-ordinate axes) that capture maximum variance in data  Learning projection directions that result in smallest reconstruction error  PCA also assumes that the projection directions are orthonormal argmin 𝑾,𝒁 𝑛=1 𝑁 𝒙𝑛 βˆ’ 𝑾𝒛𝑛 2 = argmin 𝑾,𝒁 𝑿 βˆ’ 𝒁𝑾 2 𝑀2 𝑀1 PCA is essentially doing a change of axes in which we are representing the data 𝑒1 𝑒2 𝑒1, 𝑒2: Standard co-ordinate axis (𝒙 = [π‘₯1, π‘₯2]) 𝑀1, 𝑀2: New co-ordinate axis (𝒛 = [𝑧1, 𝑧2]) Each input will still have 2 co- ordinates, in the new co-ordinate system, equal to the distances measured from the new origin To reduce dimension, can only keep the co- ordinates of those directions that have largest variances (e.g., in this example, if we want to reduce to one-dim, we can keep the co- ordinate 𝑧1 of each point along 𝑀1 and throw away 𝑧2). We won’t lose much information Subject to orthonormality constraints: π’˜π‘– ⊀ π’˜π‘— = 0 for 𝑖 β‰  𝑗 and π’˜π‘– 2 = 1
  • 6. CS771: Intro to ML Principal Component Analysis: the algorithm 6  Center the data (subtract the mean 𝝁 = 1 𝑁 𝑛=1 𝑁 𝒙𝑛 from each data point)  Compute the 𝐷 Γ— 𝐷 covariance matrix 𝐒 using the centered data matrix 𝐗 as  Do an eigendecomposition of the covariance matrix 𝐒 (many methods exist)  Take top 𝐾 < 𝐷 leading eigvectors {π’˜1, π’˜2, … , π’˜πΎ} with eigvalues {πœ†1, πœ†2, … , πœ†πΎ}  The 𝐾-dimensional projection/embedding of each input is 𝐒 = 1 𝑁 π—βŠ€ 𝐗 (Assuming 𝐗 is arranged as 𝑁 Γ— 𝐷) 𝒛𝑛 β‰ˆ 𝐖𝐾 ⊀ 𝒙𝑛 𝐖K = π’˜1, π’˜2, … , π’˜πΎ is the β€œprojection matrix” of size 𝐷 Γ— 𝐾 Note: Can decide how many eigvecs to use based on how much variance we want to campure (recall that each πœ†π‘˜ gives the variance in the π‘˜π‘‘β„Ž direction (and their sum is the total variance)
  • 7. CS771: Intro to ML Understanding PCA: The variance perspective 7
  • 8. CS771: Intro to ML Solving PCA by Finding Max. Variance Directions 8  Consider projecting an input 𝒙𝑛 ∈ ℝ𝐷 along a direction π’˜1 ∈ ℝ𝐷  Projection/embedding of 𝒙𝑛 (red points below) will be π’˜1 ⊀ 𝒙𝑛 (green pts below)  Want π’˜1 such that variance π’˜1 ⊀ π‘Ίπ’˜1 is maximized π’˜1 𝒙𝑛 Mean of projections of all inputs: 1 𝑁 𝑛=1 𝑁 π’˜1 ⊀ 𝒙𝑛 = π’˜1 ⊀ ( 1 𝑁 𝑛=1 𝑁 𝒙𝑛) = π’˜1 ⊀ 𝝁 Variance of the projections: 1 𝑁 𝑛=1 𝑁 (π’˜1 ⊀ 𝒙𝑛 βˆ’ π’˜1 ⊀ 𝝁)2 = 1 𝑁 𝑛=1 𝑁 {π’˜1 ⊀ (π’™π‘›βˆ’π)}2 = π’˜1 ⊀ π‘Ίπ’˜1 𝑺 is the 𝐷 Γ— 𝐷 cov matrix of the data: 𝑺 = 1 𝑁 𝑛=1 𝑁 (𝒙𝑛 βˆ’ 𝝁)(𝒙𝑛 βˆ’ 𝝁)⊀ For already centered data, 𝝁 = 𝟎 and 𝑺 = 1 𝑁 𝑛=1 𝑁 𝒙𝑛 𝒙𝑛 ⊀ = 1 𝑁 π‘Ώπ‘ΏβŠ€ argmax π’˜1 π’˜1 ⊀ π‘Ίπ’˜1 s.t. π’˜1 ⊀ π’˜1 = 1 Need this constraint otherwise the objective’s max will be
  • 9. CS771: Intro to ML Max. Variance Direction 9  Our objective function was argmax π’˜1 π’˜1 ⊀ π‘Ίπ’˜1 s.t. π’˜1 ⊀ π’˜1 = 1  Can construct a Lagrangian for this problem  Taking derivative w.r.t. π’˜1 and setting to zero gives π‘Ίπ’˜1 = πœ†1π’˜1  Therefore π’˜1 is an eigenvector of the cov matrix 𝑺 with eigenvalue πœ†1  Claim: π’˜1 is the eigenvector of 𝑺 with largest eigenvalue πœ†1. Note that  Thus variance π’˜1 ⊀ π‘Ίπ’˜1 will be max. if πœ†1 is the largest eigenvalue (and π’˜1 is the corresponding top eigenvector; also known as the first Principal Component)  Other large variance directions can also be found likewise (with each being argmax π’˜1 π’˜1 ⊀ π‘Ίπ’˜1 + πœ†1(1-π’˜1 ⊀ π’˜1) π’˜1 ⊀ π‘Ίπ’˜1 = πœ†1π’˜1 ⊀ π’˜1 = πœ†1 Variance along the direction π’˜1 Note: In general, 𝑺 will have 𝐷 eigvecs Note: Total variance of the data is equal to the sum of eigenvalues of 𝑺, i.e., 𝑑=1 𝐷 πœ†π‘‘ PCA would keep the top 𝐾 < 𝐷 such directions of largest variances
  • 10. CS771: Intro to ML Understanding PCA: The reconstruction perspective 10
  • 11. CS771: Intro to ML Alternate Basis and Reconstruction 11  Representing a data point 𝒙𝑛 = [π‘₯𝑛1, π‘₯𝑛2, … , π‘₯𝑛𝐷] ⊀ in the standard orthonormal basis 𝒆1, 𝒆2, … , 𝒆𝐷  Let’s represent the same data point in a new orthonormal basis π’˜1, π’˜2, … , π’˜π·  Ignoring directions along which projection 𝑧𝑛𝑑 is small, we can approximate 𝒙𝑛 as  Now 𝒙𝑛 is represented by 𝐾 < 𝐷 dim. rep. 𝒛𝑛 = [𝑧𝑛1, 𝑧𝑛2, … , 𝑧𝑛𝐾] and (verify) 𝒙𝑛 = 𝑑=1 𝐷 π‘₯𝑛𝑑𝒆𝑑 𝒆𝑑 is a vector of all zeros except a single 1 at the π‘‘π‘‘β„Ž position. Also, 𝒆𝑑 ⊀ 𝒆𝑑′ = 0 for 𝑑 β‰  𝑑’ 𝒙𝑛 = 𝑑=1 𝐷 π‘§π‘›π‘‘π’˜π‘‘ 𝒛𝑛 = [𝑧𝑛1, 𝑧𝑛2, … , 𝑧𝑛𝐷] ⊀ denotes the co-ordinates of 𝒙𝑛 in the new basis 𝑧𝑛𝑑 is the projection of 𝒙𝑛 along the direction π’˜π‘‘ since 𝑧𝑛𝑑 = π’˜π‘‘ ⊀ 𝒙𝑛 = 𝒙𝑛 ⊀ π’˜π‘‘(verify) 𝒙𝑛 β‰ˆ 𝒙𝑛 = 𝑑=1 𝐾 π‘§π‘›π‘‘π’˜π‘‘ = 𝑑=1 𝐾 (𝒙𝑛 βŠ€π’˜π‘‘)π’˜π‘‘ = 𝑑=1 𝐾 (π’˜π‘‘π’˜π‘‘ ⊀ )𝒙𝑛 𝒛𝑛 β‰ˆ 𝐖𝐾 ⊀ 𝒙𝑛 𝐖K = π’˜1, π’˜2, … , π’˜πΎ is the β€œprojection matrix” of size 𝐷 Γ— 𝐾 Note that 𝒙𝑛 βˆ’ Also, 𝒙𝑛 β‰ˆ 𝐖𝐾𝒛𝑛
  • 12. CS771: Intro to ML Minimizing Reconstruction Error 12  We plan to use only 𝐾 directions π’˜1, π’˜2, … , π’˜πΎ so would like them to be such that the total reconstruction error is minimized  Each optimal π’˜π‘‘ can be found by solving  Thus minimizing the reconstruction error is equivalent to maximizing variance  The 𝐾 directions can be found by solving the eigendecomposition of 𝐒  Note: 𝑑=1 𝐾 π’˜π‘‘ ⊀ π’π’˜π‘‘ = trace( 𝐖𝐾 ⊀ 𝐒𝐖𝐾)  Thus argmax𝐖𝐾 trace( 𝑾𝐾 ⊀ 𝐒𝐖𝐾) s.t. orthonormality on columns of π–π‘˜ is the same as solving the eigendec. of 𝑺 (recall that Spectral Clustering also required β„’ π’˜1, π’˜2, … , π’˜πΎ = 𝑛=1 𝑁 𝒙𝑛 βˆ’ 𝒙𝑛 2 = 𝑛=1 𝑁 𝒙𝑛 βˆ’ 𝑑=1 𝐾 (π’˜π‘‘π’˜π‘‘ ⊀ )𝒙𝑛 2 = 𝐢 βˆ’ 𝑑=1 𝐾 π’˜π‘‘ ⊀ π’π’˜π‘‘ (verify) Constant; doesn’t depend on the π’˜π‘‘β€™s Variance along π’˜π‘‘ argmin π’˜π‘‘ β„’ π’˜1, π’˜2, … , π’˜πΎ = argmax π’˜π‘‘ π’˜π‘‘ ⊀ π’π’˜π‘‘
  • 13. CS771: Intro to ML Principal Component Analysis 13  Center the data (subtract the mean 𝝁 = 1 𝑁 𝑛=1 𝑁 𝒙𝑛 from each data point)  Compute the 𝐷 Γ— 𝐷 covariance matrix 𝐒 using the centered data matrix 𝐗 as  Do an eigendecomposition of the covariance matrix 𝐒 (many methods exist)  Take top 𝐾 < 𝐷 leading eigvectors {π’˜1, π’˜2, … , π’˜πΎ} with eigvalues {πœ†1, πœ†2, … , πœ†πΎ}  The 𝐾-dimensional projection/embedding of each input is 𝐒 = 1 𝑁 π—βŠ€ 𝐗 (Assuming 𝐗 is arranged as 𝑁 Γ— 𝐷) 𝒛𝑛 β‰ˆ 𝐖𝐾 ⊀ 𝒙𝑛 𝐖K = π’˜1, π’˜2, … , π’˜πΎ is the β€œprojection matrix” of size 𝐷 Γ— 𝐾 Note: Can decide how many eigvecs to use based on how much variance we want to campure (recall that each πœ†π‘˜ gives the variance in the π‘˜π‘‘β„Ž direction (and their sum is the total variance)
  • 14. CS771: Intro to ML Singular Value Decomposition (SVD) 14  Any matrix 𝐗 of size 𝑁 Γ— 𝐷 can be represented as the following decomposition  𝐔 = 𝒖1, 𝒖2, … , 𝒖𝑁 is 𝑁 Γ— 𝑁 matrix of left singular vectors, each 𝒖𝑛 ∈ ℝ𝑁  𝐔 is also orthonormal  𝐕 = 𝒗1, 𝒗2, … , 𝒗𝑁 is 𝐷 Γ— 𝐷 matrix of right singular vectors, each 𝒗𝑑 ∈ ℝ𝐷  𝐕 is also orthonormal 𝐗 = π”πš²π•βŠ€ = π‘˜=1 min{𝑁,𝐷} πœ†π‘˜π’–π‘˜π’—π‘˜ ⊀ Diagonal matrix. If 𝑁 > 𝐷, last 𝐷 βˆ’ 𝑁 rows are all zeros; if 𝐷 > 𝑁, last 𝐷 βˆ’ 𝑁 columns are all zeros
  • 15. CS771: Intro to ML Low-Rank Approximation via SVD 15  If we just use the top 𝐾 < min{𝑁, 𝐷} singular values, we get a rank-𝐾 SVD  Above SVD approx. can be shown to minimize the reconstruction error 𝑿 βˆ’ 𝑿  Fact: SVD gives the best rank-𝐾 approximation of a matrix  PCA is done by doing SVD on the covariance matrix 𝐒 (left and right
  • 16. CS771: Intro to ML Dim-Red as Matrix Factorization 16  If we don’t care about the orthonormality constraints, then dim-red can also be achieved by solving a matrix factorization problem on the data matrix 𝐗  Can solve such problems using ALT-OPT X Z W N D D K K N β‰ˆ Matrix containing the low-dim rep of 𝐗 𝐙, 𝐖 = argmin𝐙,𝐖 𝐗 βˆ’ 𝐙𝐖 2 If 𝐾 < min{𝐷, 𝑁}, such a factorization gives a low-rank approximation of the data matrix X