SlideShare a Scribd company logo
1/16
18.650 – Fundamentals of Statistics
8. Principal Component Analysis (PCA)
2/16
Multivariate statistics
I Let X be a d-dimensional random vector and X1, . . . , Xn be
n independent copies of X.
I Write Xi = (X
(1)
i , . . . , X
(d)
i )>, i = 1, . . . , n.
I Denote by X the random n × d matrix
X =



· · · X>
1 · · ·
.
.
.
· · · X>
n · · ·


 .
3/16
Multivariate statistics
I Assume that IE[kXk2
2] < ∞.
I Mean of X:
IE[X] =

IE[X(1)
], . . . , IE[X(d)
]

.
I Covariance matrix of X: the matrix Σ = (σj,k)j,k=1,...,d, where
σj,k = cov(X(j)
, X(k)
).
I It is easy to see that
Σ = IE[XX
]−IE[X]IE[X]
= IE
h
(X−IE[X])(X−IE[X])
i
.
4/16
Multivariate statistics
I Empirical mean of X1, . . . , Xn:
X̄ =
1
n
n
X
i=1
Xi =

X̄(1)
, . . . , X̄(d)

.
I Empirical covariance of X1, . . . , Xn: the matrix
S = (sj,k)j,k=1,...,d where sj,k is the empirical covariance of
the X
(j)
i , X
(h)
i , i = 1 . . . , n.
I It is easy to see that
S =
1
n
n
X
i=1
XiX
i − X̄X̄
=
1
n
n
X
i=1
Xi − X̄

Xi − X̄

.
5/16
Multivariate statistics
I Note that X̄ =
1
n
X
1
I, where 1
I = (1, . . . , 1) ∈ IRd .
I Note also that
S =
1
n
X
X −
1
n2
X1
I1
I
X =
1
n
X
HX,
where H = In − 1
n1
I1
I.
I H is an orthogonal projector: H2 = H, H = H. (on what
subspace ?)
I If u ∈ IRd,
I u
Σu = var(u
X)
I u
Su is the sample variance of u
X1, . . . , u
Xn.
6/16
Multivariate statistics
I In particular, uSu measures how spread (i.e., diverse) the
points are in direction u.
I If uSu = 0, then all Xi’ s are in an affine subspace
orthogonal to u.
I If uΣu = 0, then X is almost surely in an affine subspace
orthogonal to u.
I If uSu is large with kuk2 = 1, then the direction of u
explains well the spread (i.e., diversity) of the sample.
7/16
Review of linear algebra
I In particular, Σ and S are symmetric, positive semi-definite.
I Any real symmetric matrix A ∈ IRd×d has the spectral
decomposition
A = PDP
,
where:
I P is a d × d orthogonal matrix, i.e., PP
= P
P = Id;
I D is diagonal.
I The diagonal elements of D are the eigenvalues of A and the
columns of P are the corresponding eigenvectors of A.
I A is semi-definite positive iff all its eigenvalues are
nonnegative.
8/16
Principal Component Analysis
I The sample X1, . . . , Xn makes a cloud of points in IRd.
I In practice, d is large. If d  3, it becomes impossible to
represent the cloud on a picture.
I Question: Is it possible to project the cloud onto a linear
subspace of dimension d0  d by keeping as much information
as possible ?
I Answer: PCA does this by keeping as much covariance
structure as possible by keeping orthogonal directions that
discriminate well the points of the cloud.
9/16
Variances
I Idea: Write S = PDP, where
I P = (v1, . . . , vd) is an orthogonal matrix, i.e.,
kvjk2 = 1, v
j vk = 0, ∀j 6= k.
I
D = diag(λ1, . . . , λd) =









λ1
λ2 0
...
0 ...
λd









with λ1 ≥ . . . ≥ λd ≥ 0.
I Note that D is the empirical covariance matrix of the
PXi’s, i = 1, . . . , n.
I In particular, λ1 is the empirical variance of the v
1 Xi’s; λ2 is
the empirical variance of the v
2 Xi’s, etc...
10/16
Projection
I So, each λj measures the spread of the cloud in the direction
vj.
I In particular, v1 is the direction of maximal spread.
I Indeed, v1 maximizes the empirical covariance of
aX1, . . . , aXn over a ∈ IRd such that kak2 = 1.
I Proof: For any unit vector a, show that
a
Σa =

P
a

D

P
a

≤ λ1,
with equality if a = v1.
11/16
Principal Component Analysis: Main principle
I Idea of the PCA: Find the collection of orthogonal directions
in which the cloud is much spread out.
Theorem
v1 ∈ argmax
kuk=1
u
Su,
v2 ∈ argmax
kuk=1,u⊥v1
u
Su,
· · ·
vd ∈ argmax
kuk=1,u⊥vj,j=1,...,d−1
u
Su.
Hence, the k orthogonal directions in which the cloud is the most
spread out correspond exactly to the eigenvectors associated with
the k largest values of S. They are called principal directions
12/16
Principal Component Analysis: Algorithm
1. Input: X1, . . . , Xn: cloud of n points in dimension d.
2. Step 1: Compute the empirical covariance matrix.
3. Step 2: Compute the spectral decomposition S = PDP,
where D = diag(λ1, . . . , λd), with λ1 ≥ λ2 ≥ . . . ≥ λd and
P = (v1, . . . , vd) is an orthogonal matrix.
4. Step 3: Choose k  d and set Pk = (v1, . . . , vk) ∈ IRd×k.
5. Output: Y1, . . . , Yn, where
Yi = P
k Xi ∈ IRk
, i = 1, . . . , n.
Question: How to choose k ?
13/16
How to choose the number of principal components k?
I Experimental rule: Take k where there is an inflection point in
the sequence λ1, . . . , λd (scree plot).
I Define a criterion: Take k such that
proportion of explained variance=
λ1 + . . . + λk
λ1 + . . . + λd
≥ 1 − α,
for some α ∈ (0, 1) that determines the approximation error
that the practitioner wants to achieve.
I Remark: λ1 + . . . + λk is called the variance explained by the
PCA and λ1 + . . . + λd = tr(S) is the total variance.
I Data visualization: Take k = 2 or 3.
14/16
Example: Expression of 500,000 genes among 1400
Europeans
15/16
Principal Component Analysis - Beyond practice
I PCA is an algorithm that reduces the dimension of a cloud of
points and keeps its covariance structure as much as possible.
I In practice this algorithm is used for clouds of points that are
not necessarily random.
I In statistics, PCA can be used for estimation.
I If X1, . . . , Xn are i.i.d. random vectors in IRd, how to
estimate their population covariance matrix Σ ?
I If n  d, then the empirical covariance matrix S is a
consistent estimator.
I In many applications, n  d (e.g., gene expression). Solution:
sparse PCA
16/16
Principal Component Analysis - Beyond practice
I It may be known beforehand that Σ has (almost) low rank.
I Then, run PCA on S: Write S ≈ S0, where
S0
= P












λ1
λ2 0
...
λk
0
0 ...
0












P
.
I S0 will be a better estimator of S under the low-rank
assumption.
I A theoretical analysis would lead to an optimal choice of the
tuning parameter k.

More Related Content

Similar to asset-v1_MITx+18.6501x+2T2020+type@asset+block@lectureslides_Chap8-noPhantom.pdf

Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learning
Steve Nouri
 
Pca ppt
Pca pptPca ppt
Pca ppt
Alaa Tharwat
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdf
Alexander Litvinenko
 
Mgm
MgmMgm
Interpolation techniques - Background and implementation
Interpolation techniques - Background and implementationInterpolation techniques - Background and implementation
Interpolation techniques - Background and implementation
Quasar Chunawala
 
Jere Koskela slides
Jere Koskela slidesJere Koskela slides
Jere Koskela slides
Christian Robert
 
Communication Theory - Random Process.pdf
Communication Theory - Random Process.pdfCommunication Theory - Random Process.pdf
Communication Theory - Random Process.pdf
RajaSekaran923497
 
Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
SSA KPI
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
The Euclidean Spaces (elementary topology and sequences)
The Euclidean Spaces (elementary topology and sequences)The Euclidean Spaces (elementary topology and sequences)
The Euclidean Spaces (elementary topology and sequences)
JelaiAujero
 
Fixed Point Theorm In Probabilistic Analysis
Fixed Point Theorm In Probabilistic AnalysisFixed Point Theorm In Probabilistic Analysis
Fixed Point Theorm In Probabilistic Analysis
iosrjce
 
The Probability that a Matrix of Integers Is Diagonalizable
The Probability that a Matrix of Integers Is DiagonalizableThe Probability that a Matrix of Integers Is Diagonalizable
The Probability that a Matrix of Integers Is Diagonalizable
Jay Liew
 
Cs229 notes11
Cs229 notes11Cs229 notes11
Cs229 notes11
VuTran231
 
Tensor 1
Tensor  1Tensor  1
Tensor 1
BAIJU V
 
Multivariate Distributions, an overview
Multivariate Distributions, an overviewMultivariate Distributions, an overview
Multivariate Distributions, an overview
Arthur Charpentier
 
Machine learning (12)
Machine learning (12)Machine learning (12)
Machine learning (12)
NYversity
 
multivariate normal distribution.pdf
multivariate normal distribution.pdfmultivariate normal distribution.pdf
multivariate normal distribution.pdf
rishumaurya10
 
Lecture_note2.pdf
Lecture_note2.pdfLecture_note2.pdf
Lecture_note2.pdf
EssaAlMadhagi
 
Lecture7 channel capacity
Lecture7   channel capacityLecture7   channel capacity
Lecture7 channel capacity
Frank Katta
 
sublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energiessublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energies
Fujimoto Keisuke
 

Similar to asset-v1_MITx+18.6501x+2T2020+type@asset+block@lectureslides_Chap8-noPhantom.pdf (20)

Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learning
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdf
 
Mgm
MgmMgm
Mgm
 
Interpolation techniques - Background and implementation
Interpolation techniques - Background and implementationInterpolation techniques - Background and implementation
Interpolation techniques - Background and implementation
 
Jere Koskela slides
Jere Koskela slidesJere Koskela slides
Jere Koskela slides
 
Communication Theory - Random Process.pdf
Communication Theory - Random Process.pdfCommunication Theory - Random Process.pdf
Communication Theory - Random Process.pdf
 
Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
The Euclidean Spaces (elementary topology and sequences)
The Euclidean Spaces (elementary topology and sequences)The Euclidean Spaces (elementary topology and sequences)
The Euclidean Spaces (elementary topology and sequences)
 
Fixed Point Theorm In Probabilistic Analysis
Fixed Point Theorm In Probabilistic AnalysisFixed Point Theorm In Probabilistic Analysis
Fixed Point Theorm In Probabilistic Analysis
 
The Probability that a Matrix of Integers Is Diagonalizable
The Probability that a Matrix of Integers Is DiagonalizableThe Probability that a Matrix of Integers Is Diagonalizable
The Probability that a Matrix of Integers Is Diagonalizable
 
Cs229 notes11
Cs229 notes11Cs229 notes11
Cs229 notes11
 
Tensor 1
Tensor  1Tensor  1
Tensor 1
 
Multivariate Distributions, an overview
Multivariate Distributions, an overviewMultivariate Distributions, an overview
Multivariate Distributions, an overview
 
Machine learning (12)
Machine learning (12)Machine learning (12)
Machine learning (12)
 
multivariate normal distribution.pdf
multivariate normal distribution.pdfmultivariate normal distribution.pdf
multivariate normal distribution.pdf
 
Lecture_note2.pdf
Lecture_note2.pdfLecture_note2.pdf
Lecture_note2.pdf
 
Lecture7 channel capacity
Lecture7   channel capacityLecture7   channel capacity
Lecture7 channel capacity
 
sublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energiessublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energies
 

Recently uploaded

Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
MuhammadJazib15
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
Lubi Valves
 
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdfAsymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
felixwold
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Massimo Talia
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
Kamal Acharya
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
drshikhapandey2022
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
Pallavi Sharma
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
PreethaV16
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Digital Image Processing Unit -2 Notes complete
Digital Image Processing Unit -2 Notes completeDigital Image Processing Unit -2 Notes complete
Digital Image Processing Unit -2 Notes complete
shubhamsaraswat8740
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
DharmaBanothu
 
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Transcat
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
pvpriya2
 

Recently uploaded (20)

Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
Impartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 StandardImpartiality as per ISO /IEC 17025:2017 Standard
Impartiality as per ISO /IEC 17025:2017 Standard
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
 
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdfAsymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
Levelised Cost of Hydrogen (LCOH) Calculator Manual
Levelised Cost of Hydrogen  (LCOH) Calculator ManualLevelised Cost of Hydrogen  (LCOH) Calculator Manual
Levelised Cost of Hydrogen (LCOH) Calculator Manual
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Digital Image Processing Unit -2 Notes complete
Digital Image Processing Unit -2 Notes completeDigital Image Processing Unit -2 Notes complete
Digital Image Processing Unit -2 Notes complete
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
 
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
 

asset-v1_MITx+18.6501x+2T2020+type@asset+block@lectureslides_Chap8-noPhantom.pdf

  • 1. 1/16 18.650 – Fundamentals of Statistics 8. Principal Component Analysis (PCA)
  • 2. 2/16 Multivariate statistics I Let X be a d-dimensional random vector and X1, . . . , Xn be n independent copies of X. I Write Xi = (X (1) i , . . . , X (d) i )>, i = 1, . . . , n. I Denote by X the random n × d matrix X =    · · · X> 1 · · · . . . · · · X> n · · ·    .
  • 3. 3/16 Multivariate statistics I Assume that IE[kXk2 2] < ∞. I Mean of X: IE[X] = IE[X(1) ], . . . , IE[X(d) ] . I Covariance matrix of X: the matrix Σ = (σj,k)j,k=1,...,d, where σj,k = cov(X(j) , X(k) ). I It is easy to see that Σ = IE[XX ]−IE[X]IE[X] = IE h (X−IE[X])(X−IE[X]) i .
  • 4. 4/16 Multivariate statistics I Empirical mean of X1, . . . , Xn: X̄ = 1 n n X i=1 Xi = X̄(1) , . . . , X̄(d) . I Empirical covariance of X1, . . . , Xn: the matrix S = (sj,k)j,k=1,...,d where sj,k is the empirical covariance of the X (j) i , X (h) i , i = 1 . . . , n. I It is easy to see that S = 1 n n X i=1 XiX i − X̄X̄ = 1 n n X i=1 Xi − X̄ Xi − X̄ .
  • 5. 5/16 Multivariate statistics I Note that X̄ = 1 n X 1 I, where 1 I = (1, . . . , 1) ∈ IRd . I Note also that S = 1 n X X − 1 n2 X1 I1 I X = 1 n X HX, where H = In − 1 n1 I1 I. I H is an orthogonal projector: H2 = H, H = H. (on what subspace ?) I If u ∈ IRd, I u Σu = var(u X) I u Su is the sample variance of u X1, . . . , u Xn.
  • 6. 6/16 Multivariate statistics I In particular, uSu measures how spread (i.e., diverse) the points are in direction u. I If uSu = 0, then all Xi’ s are in an affine subspace orthogonal to u. I If uΣu = 0, then X is almost surely in an affine subspace orthogonal to u. I If uSu is large with kuk2 = 1, then the direction of u explains well the spread (i.e., diversity) of the sample.
  • 7. 7/16 Review of linear algebra I In particular, Σ and S are symmetric, positive semi-definite. I Any real symmetric matrix A ∈ IRd×d has the spectral decomposition A = PDP , where: I P is a d × d orthogonal matrix, i.e., PP = P P = Id; I D is diagonal. I The diagonal elements of D are the eigenvalues of A and the columns of P are the corresponding eigenvectors of A. I A is semi-definite positive iff all its eigenvalues are nonnegative.
  • 8. 8/16 Principal Component Analysis I The sample X1, . . . , Xn makes a cloud of points in IRd. I In practice, d is large. If d 3, it becomes impossible to represent the cloud on a picture. I Question: Is it possible to project the cloud onto a linear subspace of dimension d0 d by keeping as much information as possible ? I Answer: PCA does this by keeping as much covariance structure as possible by keeping orthogonal directions that discriminate well the points of the cloud.
  • 9. 9/16 Variances I Idea: Write S = PDP, where I P = (v1, . . . , vd) is an orthogonal matrix, i.e., kvjk2 = 1, v j vk = 0, ∀j 6= k. I D = diag(λ1, . . . , λd) =          λ1 λ2 0 ... 0 ... λd          with λ1 ≥ . . . ≥ λd ≥ 0. I Note that D is the empirical covariance matrix of the PXi’s, i = 1, . . . , n. I In particular, λ1 is the empirical variance of the v 1 Xi’s; λ2 is the empirical variance of the v 2 Xi’s, etc...
  • 10. 10/16 Projection I So, each λj measures the spread of the cloud in the direction vj. I In particular, v1 is the direction of maximal spread. I Indeed, v1 maximizes the empirical covariance of aX1, . . . , aXn over a ∈ IRd such that kak2 = 1. I Proof: For any unit vector a, show that a Σa = P a D P a ≤ λ1, with equality if a = v1.
  • 11. 11/16 Principal Component Analysis: Main principle I Idea of the PCA: Find the collection of orthogonal directions in which the cloud is much spread out. Theorem v1 ∈ argmax kuk=1 u Su, v2 ∈ argmax kuk=1,u⊥v1 u Su, · · · vd ∈ argmax kuk=1,u⊥vj,j=1,...,d−1 u Su. Hence, the k orthogonal directions in which the cloud is the most spread out correspond exactly to the eigenvectors associated with the k largest values of S. They are called principal directions
  • 12. 12/16 Principal Component Analysis: Algorithm 1. Input: X1, . . . , Xn: cloud of n points in dimension d. 2. Step 1: Compute the empirical covariance matrix. 3. Step 2: Compute the spectral decomposition S = PDP, where D = diag(λ1, . . . , λd), with λ1 ≥ λ2 ≥ . . . ≥ λd and P = (v1, . . . , vd) is an orthogonal matrix. 4. Step 3: Choose k d and set Pk = (v1, . . . , vk) ∈ IRd×k. 5. Output: Y1, . . . , Yn, where Yi = P k Xi ∈ IRk , i = 1, . . . , n. Question: How to choose k ?
  • 13. 13/16 How to choose the number of principal components k? I Experimental rule: Take k where there is an inflection point in the sequence λ1, . . . , λd (scree plot). I Define a criterion: Take k such that proportion of explained variance= λ1 + . . . + λk λ1 + . . . + λd ≥ 1 − α, for some α ∈ (0, 1) that determines the approximation error that the practitioner wants to achieve. I Remark: λ1 + . . . + λk is called the variance explained by the PCA and λ1 + . . . + λd = tr(S) is the total variance. I Data visualization: Take k = 2 or 3.
  • 14. 14/16 Example: Expression of 500,000 genes among 1400 Europeans
  • 15. 15/16 Principal Component Analysis - Beyond practice I PCA is an algorithm that reduces the dimension of a cloud of points and keeps its covariance structure as much as possible. I In practice this algorithm is used for clouds of points that are not necessarily random. I In statistics, PCA can be used for estimation. I If X1, . . . , Xn are i.i.d. random vectors in IRd, how to estimate their population covariance matrix Σ ? I If n d, then the empirical covariance matrix S is a consistent estimator. I In many applications, n d (e.g., gene expression). Solution: sparse PCA
  • 16. 16/16 Principal Component Analysis - Beyond practice I It may be known beforehand that Σ has (almost) low rank. I Then, run PCA on S: Write S ≈ S0, where S0 = P             λ1 λ2 0 ... λk 0 0 ... 0             P . I S0 will be a better estimator of S under the low-rank assumption. I A theoretical analysis would lead to an optimal choice of the tuning parameter k.