SlideShare a Scribd company logo
1 of 6
Download to read offline
CS229 Lecture notes
Andrew Ng
Part XI
Principal components analysis
In our discussion of factor analysis, we gave a way to model data x ∈ Rd
as
“approximately” lying in some k-dimension subspace, where k ≪ d. Specifi-
cally, we imagined that each point x(i)
was created by first generating some
z(i)
lying in the k-dimension affine space {Λz + µ; z ∈ Rk
}, and then adding
Ψ-covariance noise. Factor analysis is based on a probabilistic model, and
parameter estimation used the iterative EM algorithm.
In this set of notes, we will develop a method, Principal Components
Analysis (PCA), that also tries to identify the subspace in which the data
approximately lies. However, PCA will do so more directly, and will require
only an eigenvector calculation (easily done with the eig function in Matlab),
and does not need to resort to EM.
Suppose we are given a dataset {x(i)
; i = 1, . . . , n} of attributes of n dif-
ferent types of automobiles, such as their maximum speed, turn radius, and
so on. Let x(i)
∈ Rd
for each i (d ≪ n). But unknown to us, two different
attributes—some xi and xj—respectively give a car’s maximum speed mea-
sured in miles per hour, and the maximum speed measured in kilometers per
hour. These two attributes are therefore almost linearly dependent, up to
only small differences introduced by rounding off to the nearest mph or kph.
Thus, the data really lies approximately on an n − 1 dimensional subspace.
How can we automatically detect, and perhaps remove, this redundancy?
For a less contrived example, consider a dataset resulting from a survey of
pilots for radio-controlled helicopters, where x
(i)
1 is a measure of the piloting
skill of pilot i, and x
(i)
2 captures how much he/she enjoys flying. Because
RC helicopters are very difficult to fly, only the most committed students,
ones that truly enjoy flying, become good pilots. So, the two attributes
x1 and x2 are strongly correlated. Indeed, we might posit that that the
1
2
data actually likes along some diagonal axis (the u1 direction) capturing the
intrinsic piloting “karma” of a person, with only a small amount of noise
lying off this axis. (See figure.) How can we automatically compute this u1
direction?
x1
x2(enjoyment)
(skill)
1
u
u
2
We will shortly develop the PCA algorithm. But prior to running PCA
per se, typically we first preprocess the data by normalizing each feature
to have mean 0 and variance 1. We do this by subtracting the mean and
dividing by the empirical standard deviation:
x
(i)
j ←
x
(i)
j − µj
σj
where µj = 1
n
n
i=1 x
(i)
j and σ2
j = 1
n
n
i=1(x
(i)
j − µj)2
are the mean variance of
feature j, respectively.
Subtracting µj zeros out the mean and may be omitted for data known
to have zero mean (for instance, time series corresponding to speech or other
acoustic signals). Dividing by the standard deviation σj rescales each coor-
dinate to have unit variance, which ensures that different attributes are all
treated on the same “scale.” For instance, if x1 was cars’ maximum speed in
mph (taking values in the high tens or low hundreds) and x2 were the num-
ber of seats (taking values around 2-4), then this renormalization rescales
the different attributes to make them more comparable. This rescaling may
be omitted if we had a priori knowledge that the different attributes are all
on the same scale. One example of this is if each data point represented a
3
grayscale image, and each x
(i)
j took a value in {0, 1, . . . , 255} corresponding
to the intensity value of pixel j in image i.
Now, having normalized our data, how do we compute the “major axis
of variation” u—that is, the direction on which the data approximately lies?
One way is to pose this problem as finding the unit vector u so that when
the data is projected onto the direction corresponding to u, the variance of
the projected data is maximized. Intuitively, the data starts off with some
amount of variance/information in it. We would like to choose a direction u
so that if we were to approximate the data as lying in the direction/subspace
corresponding to u, as much as possible of this variance is still retained.
Consider the following dataset, on which we have already carried out the
normalization steps:
Now, suppose we pick u to correspond the the direction shown in the
figure below. The circles denote the projections of the original data onto this
line.
4
0011
00001111
0011
00001111
0100
00
00
11
11
11 000
000
000000
000000000
111
111
111111
111111111
0000
0000
1111
1111
000000
000000
000000000
111111
111111
111111111
0000
0000
1111
1111
We see that the projected data still has a fairly large variance, and the
points tend to be far from zero. In contrast, suppose had instead picked the
following direction:
0011
0000111101
000011110
0
1
1
000000000000000000000000000000000
0000000000000000000000
000000000000000000000000000000000
0000000000000000000000
000000000000000000000000000000000
0000000000000000000000
0000000000000000000000
111111111111111111111111111111111
1111111111111111111111
111111111111111111111111111111111
1111111111111111111111
11111111111111111111111111111111111111111111
11111111111
1111111111111111111111
0000000
0000000000000000000000000000
0000000
0000000000000000000000000000
0000000
11111111111111
111111111111111111111
11111111111111
111111111111111111111
1111111
00000000
00000000
0000000000000000
00000000
00000000
00000000
0000000000000000
00000000
00000000
0000000000000000
00000000
00000000
11111111
11111111
11111111
11111111
11111111
1111111111111111
11111111
11111111
1111111111111111
11111111
11111111
11111111
11111111
0000000000000000
0000000000000000
000000000000000000000000
0000000000000000
00000000
00000000
0000000000000000
00000000
00000000
111111111111111111111111
11111111
11111111111111111111111111111111
11111111
11111111
1111111111111111
11111111
1111111111111111
00000
00000
0000000000
00000
00000
00000
00000
11111
11111
1111111111
11111
11111
11111
11111
Here, the projections have a significantly smaller variance, and are much
closer to the origin.
We would like to automatically select the direction u corresponding to
the first of the two figures shown above. To formalize this, note that given a
5
unit vector u and a point x, the length of the projection of x onto u is given
by xT
u. I.e., if x(i)
is a point in our dataset (one of the crosses in the plot),
then its projection onto u (the corresponding circle in the figure) is distance
xT
u from the origin. Hence, to maximize the variance of the projections, we
would like to choose a unit-length u so as to maximize:
1
n
n
i=1
(x(i)T
u)2
=
1
n
n
i=1
uT
x(i)
x(i)T
u
= uT 1
n
n
i=1
x(i)
x(i)T
u.
We easily recognize that the maximizing this subject to u 2 = 1 gives the
principal eigenvector of Σ = 1
n
n
i=1 x(i)
x(i)T
, which is just the empirical
covariance matrix of the data (assuming it has zero mean).1
To summarize, we have found that if we wish to find a 1-dimensional
subspace with with to approximate the data, we should choose u to be the
principal eigenvector of Σ. More generally, if we wish to project our data
into a k-dimensional subspace (k < d), we should choose u1, . . . , uk to be the
top k eigenvectors of Σ. The ui’s now form a new, orthogonal basis for the
data.2
Then, to represent x(i)
in this basis, we need only compute the corre-
sponding vector
y(i)
=





uT
1 x(i)
uT
2 x(i)
...
uT
k x(i)





∈ Rk
.
Thus, whereas x(i)
∈ Rd
, the vector y(i)
now gives a lower, k-dimensional,
approximation/representation for x(i)
. PCA is therefore also referred to as
a dimensionality reduction algorithm. The vectors u1, . . . , uk are called
the first k principal components of the data.
Remark. Although we have shown it formally only for the case of k = 1,
using well-known properties of eigenvectors it is straightforward to show that
1
If you haven’t seen this before, try using the method of Lagrange multipliers to max-
imize uT
Σu subject to that uT
u = 1. You should be able to show that Σu = λu, for some
λ, which implies u is an eigenvector of Σ, with eigenvalue λ.
2
Because Σ is symmetric, the ui’s will (or always can be chosen to be) orthogonal to
each other.
6
of all possible orthogonal bases u1, . . . , uk, the one that we have chosen max-
imizes i y(i) 2
2. Thus, our choice of a basis preserves as much variability
as possible in the original data.
In problem set 4, you will see that PCA can also be derived by picking
the basis that minimizes the approximation error arising from projecting the
data onto the k-dimensional subspace spanned by them.
PCA has many applications; we will close our discussion with a few exam-
ples. First, compression—representing x(i)
’s with lower dimension y(i)
’s—is
an obvious application. If we reduce high dimensional data to k = 2 or 3 di-
mensions, then we can also plot the y(i)
’s to visualize the data. For instance,
if we were to reduce our automobiles data to 2 dimensions, then we can plot
it (one point in our plot would correspond to one car type, say) to see what
cars are similar to each other and what groups of cars may cluster together.
Another standard application is to preprocess a dataset to reduce its
dimension before running a supervised learning learning algorithm with the
x(i)
’s as inputs. Apart from computational benefits, reducing the data’s
dimension can also reduce the complexity of the hypothesis class considered
and help avoid overfitting (e.g., linear classifiers over lower dimensional input
spaces will have smaller VC dimension).
Lastly, as in our RC pilot example, we can also view PCA as a noise
reduction algorithm. In our example it, estimates the intrinsic “piloting
karma” from the noisy measures of piloting skill and enjoyment. In class, we
also saw the application of this idea to face images, resulting in eigenfaces
method. Here, each point x(i)
∈ R100×100
was a 10000 dimensional vector,
with each coordinate corresponding to a pixel intensity value in a 100x100
image of a face. Using PCA, we represent each image x(i)
with a much lower-
dimensional y(i)
. In doing so, we hope that the principal components we
found retain the interesting, systematic variations between faces that capture
what a person really looks like, but not the “noise” in the images introduced
by minor lighting variations, slightly different imaging conditions, and so on.
We then measure distances between faces i and j by working in the reduced
dimension, and computing y(i)
−y(j)
2. This resulted in a surprisingly good
face-matching and retrieval algorithm.

More Related Content

What's hot

Independent Component Analysis
Independent Component Analysis Independent Component Analysis
Independent Component Analysis Ibrahim Amer
 
FINDING FREQUENT SUBPATHS IN A GRAPH
FINDING FREQUENT SUBPATHS IN A GRAPHFINDING FREQUENT SUBPATHS IN A GRAPH
FINDING FREQUENT SUBPATHS IN A GRAPHIJDKP
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component AnalysisSumit Singh
 
Broom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data FramesBroom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data FramesWork-Bench
 
Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...zukun
 
Advanced Support Vector Machine for classification in Neural Network
Advanced Support Vector Machine for classification  in Neural NetworkAdvanced Support Vector Machine for classification  in Neural Network
Advanced Support Vector Machine for classification in Neural NetworkAshwani Jha
 
Discrete Mathematics Presentation
Discrete Mathematics PresentationDiscrete Mathematics Presentation
Discrete Mathematics PresentationSalman Elahi
 
Dijkstra’S Algorithm
Dijkstra’S AlgorithmDijkstra’S Algorithm
Dijkstra’S Algorithmami_01
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization TechniquesAjay Bidyarthy
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationWork-Bench
 
Probabilistic PCA, EM, and more
Probabilistic PCA, EM, and moreProbabilistic PCA, EM, and more
Probabilistic PCA, EM, and morehsharmasshare
 
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...Amrinder Arora
 
Autonomous Perching Quadcopter
Autonomous Perching QuadcopterAutonomous Perching Quadcopter
Autonomous Perching QuadcopterYucheng Chen
 
Dijkstra algorithm a dynammic programming approach
Dijkstra algorithm   a dynammic programming approachDijkstra algorithm   a dynammic programming approach
Dijkstra algorithm a dynammic programming approachAkash Sethiya
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 

What's hot (20)

Independent Component Analysis
Independent Component Analysis Independent Component Analysis
Independent Component Analysis
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
FINDING FREQUENT SUBPATHS IN A GRAPH
FINDING FREQUENT SUBPATHS IN A GRAPHFINDING FREQUENT SUBPATHS IN A GRAPH
FINDING FREQUENT SUBPATHS IN A GRAPH
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Broom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data FramesBroom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data Frames
 
Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...
 
Advanced Support Vector Machine for classification in Neural Network
Advanced Support Vector Machine for classification  in Neural NetworkAdvanced Support Vector Machine for classification  in Neural Network
Advanced Support Vector Machine for classification in Neural Network
 
PCA and SVD in brief
PCA and SVD in briefPCA and SVD in brief
PCA and SVD in brief
 
Discrete Mathematics Presentation
Discrete Mathematics PresentationDiscrete Mathematics Presentation
Discrete Mathematics Presentation
 
Computer Science Assignment Help
Computer Science Assignment Help Computer Science Assignment Help
Computer Science Assignment Help
 
Dijkstra’S Algorithm
Dijkstra’S AlgorithmDijkstra’S Algorithm
Dijkstra’S Algorithm
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization Techniques
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical Computation
 
Ica group 3[1]
Ica group 3[1]Ica group 3[1]
Ica group 3[1]
 
Probabilistic PCA, EM, and more
Probabilistic PCA, EM, and moreProbabilistic PCA, EM, and more
Probabilistic PCA, EM, and more
 
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...
Proof of O(log *n) time complexity of Union find (Presentation by Wei Li, Zeh...
 
2019 GDRR: Blockchain Data Analytics - Dissecting Blockchain Price Analytics...
2019 GDRR: Blockchain Data Analytics  - Dissecting Blockchain Price Analytics...2019 GDRR: Blockchain Data Analytics  - Dissecting Blockchain Price Analytics...
2019 GDRR: Blockchain Data Analytics - Dissecting Blockchain Price Analytics...
 
Autonomous Perching Quadcopter
Autonomous Perching QuadcopterAutonomous Perching Quadcopter
Autonomous Perching Quadcopter
 
Dijkstra algorithm a dynammic programming approach
Dijkstra algorithm   a dynammic programming approachDijkstra algorithm   a dynammic programming approach
Dijkstra algorithm a dynammic programming approach
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 

Similar to Cs229 notes10

Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...NeuroMat
 
Applied parallel coordinates for logs and network traffic attack analysis
Applied parallel coordinates for logs and network traffic attack analysisApplied parallel coordinates for logs and network traffic attack analysis
Applied parallel coordinates for logs and network traffic attack analysisUltraUploader
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegressionDaniel K
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectSurya Chandra
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...ANIRBANMAJUMDAR18
 
III_Data Structure_Module_1.ppt
III_Data Structure_Module_1.pptIII_Data Structure_Module_1.ppt
III_Data Structure_Module_1.pptshashankbhadouria4
 
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...ArchiLab 7
 
Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)Mumbai Academisc
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component AnalysisSunjeet Jena
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Mengxi Jiang
 
III_Data Structure_Module_1.pptx
III_Data Structure_Module_1.pptxIII_Data Structure_Module_1.pptx
III_Data Structure_Module_1.pptxshashankbhadouria4
 
DETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTDETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTAM Publications
 
Conference_paper.pdf
Conference_paper.pdfConference_paper.pdf
Conference_paper.pdfNarenRajVivek
 
Machine learning (12)
Machine learning (12)Machine learning (12)
Machine learning (12)NYversity
 
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSSLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSIJCI JOURNAL
 

Similar to Cs229 notes10 (20)

Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Applied parallel coordinates for logs and network traffic attack analysis
Applied parallel coordinates for logs and network traffic attack analysisApplied parallel coordinates for logs and network traffic attack analysis
Applied parallel coordinates for logs and network traffic attack analysis
 
working with python
working with pythonworking with python
working with python
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegression
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems Project
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
 
III_Data Structure_Module_1.ppt
III_Data Structure_Module_1.pptIII_Data Structure_Module_1.ppt
III_Data Structure_Module_1.ppt
 
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
 
Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
 
III_Data Structure_Module_1.pptx
III_Data Structure_Module_1.pptxIII_Data Structure_Module_1.pptx
III_Data Structure_Module_1.pptx
 
DETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTDETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECT
 
Conference_paper.pdf
Conference_paper.pdfConference_paper.pdf
Conference_paper.pdf
 
Introduction to pca v2
Introduction to pca v2Introduction to pca v2
Introduction to pca v2
 
Machine learning (12)
Machine learning (12)Machine learning (12)
Machine learning (12)
 
Kriging
KrigingKriging
Kriging
 
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSSLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
 

More from VuTran231

4 work readiness and reflection on wr and pol
4 work readiness and reflection on wr and pol4 work readiness and reflection on wr and pol
4 work readiness and reflection on wr and polVuTran231
 
2 project-oriented learning examined
2 project-oriented learning examined2 project-oriented learning examined
2 project-oriented learning examinedVuTran231
 
2 blended learning directions
2 blended learning directions2 blended learning directions
2 blended learning directionsVuTran231
 
4 jigsaw review
4 jigsaw review4 jigsaw review
4 jigsaw reviewVuTran231
 
2 learning alone and together
2  learning alone and together2  learning alone and together
2 learning alone and togetherVuTran231
 
Cs229 notes11
Cs229 notes11Cs229 notes11
Cs229 notes11VuTran231
 
Cs229 notes9
Cs229 notes9Cs229 notes9
Cs229 notes9VuTran231
 
Cs229 notes8
Cs229 notes8Cs229 notes8
Cs229 notes8VuTran231
 
Cs229 notes7b
Cs229 notes7bCs229 notes7b
Cs229 notes7bVuTran231
 
Cs229 notes7a
Cs229 notes7aCs229 notes7a
Cs229 notes7aVuTran231
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12VuTran231
 
Cs229 notes-deep learning
Cs229 notes-deep learningCs229 notes-deep learning
Cs229 notes-deep learningVuTran231
 
Cs229 notes4
Cs229 notes4Cs229 notes4
Cs229 notes4VuTran231
 

More from VuTran231 (14)

4 work readiness and reflection on wr and pol
4 work readiness and reflection on wr and pol4 work readiness and reflection on wr and pol
4 work readiness and reflection on wr and pol
 
2 project-oriented learning examined
2 project-oriented learning examined2 project-oriented learning examined
2 project-oriented learning examined
 
2 blended learning directions
2 blended learning directions2 blended learning directions
2 blended learning directions
 
4 jigsaw review
4 jigsaw review4 jigsaw review
4 jigsaw review
 
2 learning alone and together
2  learning alone and together2  learning alone and together
2 learning alone and together
 
Cs229 notes11
Cs229 notes11Cs229 notes11
Cs229 notes11
 
Cs229 notes9
Cs229 notes9Cs229 notes9
Cs229 notes9
 
Cs229 notes8
Cs229 notes8Cs229 notes8
Cs229 notes8
 
Cs229 notes7b
Cs229 notes7bCs229 notes7b
Cs229 notes7b
 
Cs229 notes7a
Cs229 notes7aCs229 notes7a
Cs229 notes7a
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12
 
Cs229 notes-deep learning
Cs229 notes-deep learningCs229 notes-deep learning
Cs229 notes-deep learning
 
Cs229 notes4
Cs229 notes4Cs229 notes4
Cs229 notes4
 
EDA
EDAEDA
EDA
 

Recently uploaded

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 

Recently uploaded (20)

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 

Cs229 notes10

  • 1. CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x ∈ Rd as “approximately” lying in some k-dimension subspace, where k ≪ d. Specifi- cally, we imagined that each point x(i) was created by first generating some z(i) lying in the k-dimension affine space {Λz + µ; z ∈ Rk }, and then adding Ψ-covariance noise. Factor analysis is based on a probabilistic model, and parameter estimation used the iterative EM algorithm. In this set of notes, we will develop a method, Principal Components Analysis (PCA), that also tries to identify the subspace in which the data approximately lies. However, PCA will do so more directly, and will require only an eigenvector calculation (easily done with the eig function in Matlab), and does not need to resort to EM. Suppose we are given a dataset {x(i) ; i = 1, . . . , n} of attributes of n dif- ferent types of automobiles, such as their maximum speed, turn radius, and so on. Let x(i) ∈ Rd for each i (d ≪ n). But unknown to us, two different attributes—some xi and xj—respectively give a car’s maximum speed mea- sured in miles per hour, and the maximum speed measured in kilometers per hour. These two attributes are therefore almost linearly dependent, up to only small differences introduced by rounding off to the nearest mph or kph. Thus, the data really lies approximately on an n − 1 dimensional subspace. How can we automatically detect, and perhaps remove, this redundancy? For a less contrived example, consider a dataset resulting from a survey of pilots for radio-controlled helicopters, where x (i) 1 is a measure of the piloting skill of pilot i, and x (i) 2 captures how much he/she enjoys flying. Because RC helicopters are very difficult to fly, only the most committed students, ones that truly enjoy flying, become good pilots. So, the two attributes x1 and x2 are strongly correlated. Indeed, we might posit that that the 1
  • 2. 2 data actually likes along some diagonal axis (the u1 direction) capturing the intrinsic piloting “karma” of a person, with only a small amount of noise lying off this axis. (See figure.) How can we automatically compute this u1 direction? x1 x2(enjoyment) (skill) 1 u u 2 We will shortly develop the PCA algorithm. But prior to running PCA per se, typically we first preprocess the data by normalizing each feature to have mean 0 and variance 1. We do this by subtracting the mean and dividing by the empirical standard deviation: x (i) j ← x (i) j − µj σj where µj = 1 n n i=1 x (i) j and σ2 j = 1 n n i=1(x (i) j − µj)2 are the mean variance of feature j, respectively. Subtracting µj zeros out the mean and may be omitted for data known to have zero mean (for instance, time series corresponding to speech or other acoustic signals). Dividing by the standard deviation σj rescales each coor- dinate to have unit variance, which ensures that different attributes are all treated on the same “scale.” For instance, if x1 was cars’ maximum speed in mph (taking values in the high tens or low hundreds) and x2 were the num- ber of seats (taking values around 2-4), then this renormalization rescales the different attributes to make them more comparable. This rescaling may be omitted if we had a priori knowledge that the different attributes are all on the same scale. One example of this is if each data point represented a
  • 3. 3 grayscale image, and each x (i) j took a value in {0, 1, . . . , 255} corresponding to the intensity value of pixel j in image i. Now, having normalized our data, how do we compute the “major axis of variation” u—that is, the direction on which the data approximately lies? One way is to pose this problem as finding the unit vector u so that when the data is projected onto the direction corresponding to u, the variance of the projected data is maximized. Intuitively, the data starts off with some amount of variance/information in it. We would like to choose a direction u so that if we were to approximate the data as lying in the direction/subspace corresponding to u, as much as possible of this variance is still retained. Consider the following dataset, on which we have already carried out the normalization steps: Now, suppose we pick u to correspond the the direction shown in the figure below. The circles denote the projections of the original data onto this line.
  • 4. 4 0011 00001111 0011 00001111 0100 00 00 11 11 11 000 000 000000 000000000 111 111 111111 111111111 0000 0000 1111 1111 000000 000000 000000000 111111 111111 111111111 0000 0000 1111 1111 We see that the projected data still has a fairly large variance, and the points tend to be far from zero. In contrast, suppose had instead picked the following direction: 0011 0000111101 000011110 0 1 1 000000000000000000000000000000000 0000000000000000000000 000000000000000000000000000000000 0000000000000000000000 000000000000000000000000000000000 0000000000000000000000 0000000000000000000000 111111111111111111111111111111111 1111111111111111111111 111111111111111111111111111111111 1111111111111111111111 11111111111111111111111111111111111111111111 11111111111 1111111111111111111111 0000000 0000000000000000000000000000 0000000 0000000000000000000000000000 0000000 11111111111111 111111111111111111111 11111111111111 111111111111111111111 1111111 00000000 00000000 0000000000000000 00000000 00000000 00000000 0000000000000000 00000000 00000000 0000000000000000 00000000 00000000 11111111 11111111 11111111 11111111 11111111 1111111111111111 11111111 11111111 1111111111111111 11111111 11111111 11111111 11111111 0000000000000000 0000000000000000 000000000000000000000000 0000000000000000 00000000 00000000 0000000000000000 00000000 00000000 111111111111111111111111 11111111 11111111111111111111111111111111 11111111 11111111 1111111111111111 11111111 1111111111111111 00000 00000 0000000000 00000 00000 00000 00000 11111 11111 1111111111 11111 11111 11111 11111 Here, the projections have a significantly smaller variance, and are much closer to the origin. We would like to automatically select the direction u corresponding to the first of the two figures shown above. To formalize this, note that given a
  • 5. 5 unit vector u and a point x, the length of the projection of x onto u is given by xT u. I.e., if x(i) is a point in our dataset (one of the crosses in the plot), then its projection onto u (the corresponding circle in the figure) is distance xT u from the origin. Hence, to maximize the variance of the projections, we would like to choose a unit-length u so as to maximize: 1 n n i=1 (x(i)T u)2 = 1 n n i=1 uT x(i) x(i)T u = uT 1 n n i=1 x(i) x(i)T u. We easily recognize that the maximizing this subject to u 2 = 1 gives the principal eigenvector of Σ = 1 n n i=1 x(i) x(i)T , which is just the empirical covariance matrix of the data (assuming it has zero mean).1 To summarize, we have found that if we wish to find a 1-dimensional subspace with with to approximate the data, we should choose u to be the principal eigenvector of Σ. More generally, if we wish to project our data into a k-dimensional subspace (k < d), we should choose u1, . . . , uk to be the top k eigenvectors of Σ. The ui’s now form a new, orthogonal basis for the data.2 Then, to represent x(i) in this basis, we need only compute the corre- sponding vector y(i) =      uT 1 x(i) uT 2 x(i) ... uT k x(i)      ∈ Rk . Thus, whereas x(i) ∈ Rd , the vector y(i) now gives a lower, k-dimensional, approximation/representation for x(i) . PCA is therefore also referred to as a dimensionality reduction algorithm. The vectors u1, . . . , uk are called the first k principal components of the data. Remark. Although we have shown it formally only for the case of k = 1, using well-known properties of eigenvectors it is straightforward to show that 1 If you haven’t seen this before, try using the method of Lagrange multipliers to max- imize uT Σu subject to that uT u = 1. You should be able to show that Σu = λu, for some λ, which implies u is an eigenvector of Σ, with eigenvalue λ. 2 Because Σ is symmetric, the ui’s will (or always can be chosen to be) orthogonal to each other.
  • 6. 6 of all possible orthogonal bases u1, . . . , uk, the one that we have chosen max- imizes i y(i) 2 2. Thus, our choice of a basis preserves as much variability as possible in the original data. In problem set 4, you will see that PCA can also be derived by picking the basis that minimizes the approximation error arising from projecting the data onto the k-dimensional subspace spanned by them. PCA has many applications; we will close our discussion with a few exam- ples. First, compression—representing x(i) ’s with lower dimension y(i) ’s—is an obvious application. If we reduce high dimensional data to k = 2 or 3 di- mensions, then we can also plot the y(i) ’s to visualize the data. For instance, if we were to reduce our automobiles data to 2 dimensions, then we can plot it (one point in our plot would correspond to one car type, say) to see what cars are similar to each other and what groups of cars may cluster together. Another standard application is to preprocess a dataset to reduce its dimension before running a supervised learning learning algorithm with the x(i) ’s as inputs. Apart from computational benefits, reducing the data’s dimension can also reduce the complexity of the hypothesis class considered and help avoid overfitting (e.g., linear classifiers over lower dimensional input spaces will have smaller VC dimension). Lastly, as in our RC pilot example, we can also view PCA as a noise reduction algorithm. In our example it, estimates the intrinsic “piloting karma” from the noisy measures of piloting skill and enjoyment. In class, we also saw the application of this idea to face images, resulting in eigenfaces method. Here, each point x(i) ∈ R100×100 was a 10000 dimensional vector, with each coordinate corresponding to a pixel intensity value in a 100x100 image of a face. Using PCA, we represent each image x(i) with a much lower- dimensional y(i) . In doing so, we hope that the principal components we found retain the interesting, systematic variations between faces that capture what a person really looks like, but not the “noise” in the images introduced by minor lighting variations, slightly different imaging conditions, and so on. We then measure distances between faces i and j by working in the reduced dimension, and computing y(i) −y(j) 2. This resulted in a surprisingly good face-matching and retrieval algorithm.