The aim of this report is to use eigenvectors, eigenvalues, and orthogonality to understand the concept of Principal Component Analysis (PCA) and to show why PCA is useful.
1. Principal Component Analysis
Mason Ziemer
12/2/16
Abstract
One problemthat oftencropsup indata analysis isthe
presence of a high-dimensional dataset.We will explorea
specificdimensionreductiontechnique inthisreportcalled
Principal ComponentAnalysis.The aimof thisreportisto use
eigenvectors,eigenvalues,andorthogonality tounderstand the
conceptof Principal ComponentAnalysis(PCA) andtoshow why
PCA is useful.
2. Introduction
The aim for Principal ComponentAnalysisliesinthe title;
findingthe principal componentsof the data.PCA is usedto
projectdata intoa new,lower-dimension, coordinate system
where the axescorrespond toeachprincipal component. What
isPCA useful for? Ithelpswithreducingthe dimensionalityof a
datasetwhichinturn helpswiththe efficiencyof runninga
machine learningalgorithm.Italsosimplifiesthe dataset,
allowingforthese algorithmstorunfaster. So,whatis a
principal component?A principalcomponentisthe direction
where the mostvariance liesinthe data.To get a visual,the
firstprincipal componentof the dataonthe x-yplane isshown
below.
As youcan see above,the firstprincipal componentisthe line
where the datavariesthe most.Let’ssay we want to projectthe
data alongthe firstprincipal componentonly.Thiswould
effectivelyreduce the dimensionof the dataset fromtwo
dimensionstoone while retainingthe mostdatapossible.
Althoughthiswill destroysome of the data,itstill holds
informationfromboth the x and y. This iswhatthe projection
lookslike.
x
y
Firstprincipal component
y Firstprincipal component
3. Here is the data projectedontothe firstprincipal component.
Now,lookingbacktothe original graph,the secondprincipal
componentmustbe orthogonal tothe firstin orderto capture
the most remainingvariance thatfirstprincipal componentdid
not.Here iswhatsecondprincipal componentlookslike.
If we were to performPCA and projectthe data ontothe first
twoprincipal components,thennone of the datawouldbe lost
inthe transformation.Thisisbecause we are transformingthe
data fromthe x-yplane,whichhastwodimensions, intoanew
two-dimensional space wherethe axesare undefined.
Completingthistransformationmerelyrotatesthe dataonto
the newaxesand lookslike this:
x
x
y
Firstprincipal component
x
y
Firstprincipal component
SecondPrincipal Component
4. As youcan see none of the data has changed,we are just
lookingatit froma differentangle.
Eigenvalues and Eigenvectors
In mathematical terms, the principalcomponentsare the
eigenvectorsof the covariance matrix forthe dataset. Itwill be
illustratedinthe example below of how toobtainthe
covariance matrix alongwith the eigenvectorsandeigenvalues.
The eigenvectors of the covariance matrix pointinthe direction
where the mostvariance inthe data lies.Eacheigenvectorhasa
correspondingeigenvalue,whichisascalar,that denotesthe
amountof variance inthe data alongitscorresponding
eigenvector. There canonlybe as manyeigenvectorswith
correspondingeigenvaluesasthere are variables inthe dataset.
As the eigenvaluesgetlarger,itmeansthatmore variance in
the data is accountedfor.In the previous example fromabove,
since the data liesonthe x-yplane,there are onlytwo
eigenvectorswithcorrespondingeigenvalues.Foranydata set,
the firstprincipal componentisthe eigenvectorthat
correspondstothe largesteigenvalue.Itisalsoimportantto
note that the matrices that are formedby the datasetdonot
have to be square; the variablesmake upthe columnsof the
matrix,while the observationsmake upthe rows. Theydonot
have to be square matricesbecause we are takingthe
eigenvaluesfromthe covariance matrix,whichwill be explained
inthe example below.
New Axis
NewAxis
5. Example
For an example andimplementationof PCA Iwill refertothe iris
datasetinR. Iris contains measurements,incm, for150 iris
flowersonfourdifferentfeatures:Sepal.Length,Sepal.Width,
Petal.Length,andPetal.Width. The datasetalsocontainsthe
speciesof irisforeachirisflower.The three differentspecies
are Setosa,Versicolor,andVirginica. The fourfeaturesmake up
the columnsof our matrix,while the 150 observationsof each
feature make upthe rows of our matrix. Here iswhat the first
six rowsof the data lookslike.
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
The nextstepisto findthe covariance matrix,whichcanbe
computedbythe followingformula.
𝐶𝑂𝑉( 𝑋, 𝑌) =
∑ (𝑋𝑖
̅ − 𝑋)(𝑌𝑖
̅ − 𝑌)𝑛
𝑖=1
𝑛 − 1
Since our data sethas fourvariables (fourdimensionaldataset),
the covariance betweenall fourvariablescanbe measured.Itis
alsoimportantto rememberthatthe covariance of a variable
withitself COV(X,X) justequalsitsvariance VAR(X). Suppose we
use arbitraryvariablesW,X,Y,andZ to setup the covariance
matrix for thisexample.The resulting4x4matrix will looklike t
his:
𝑉𝐴𝑅(𝑊) 𝐶𝑂𝑉(𝑊, 𝑋)
𝐶𝑂𝑉( 𝑋, 𝑊) 𝑉𝐴𝑅(𝑋)
𝐶𝑂𝑉(𝑊, 𝑌) 𝐶𝑂𝑉(𝑊, 𝑍)
𝐶𝑂𝑉(𝑋, 𝑌) 𝐶𝑂𝑉( 𝑋, 𝑍)
𝐶𝑂𝑉(𝑌, 𝑊) 𝐶𝑂𝑉(𝑌, 𝑋)
𝐶𝑂𝑉(𝑍, 𝑊) 𝐶𝑂𝑉( 𝑍, 𝑋)
𝑉𝐴𝑅(𝑌) 𝐶𝑂𝑉(𝑌, 𝑍)
𝐶𝑂𝑉(𝑍, 𝑌) 𝑉𝐴𝑅(𝑍)
It’salso importanttonote that COV(X,Y) equal toCOV(Y,X),
hence the matrix issymmetrical aboutthe diagonal,where the
diagonal equalsthe variancesof W,X,Y,andZ.The covariance
matrix forour datasetcan be obtainedinRby the following
command.
6. > cov(iris[-5])
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
Nowthat we have obtainedthe covariance matrix forthe iris
datasetwe can now go aheadandfind eigenvectorsandtheir
correspondingeigenvalues. Remember,aneigenvectorisa
nonzerovector 𝑥⃑ such that A𝑥⃑= λ𝑥⃑ for some scalarλ. The scalar
λ is oureigenvalueforthe correspondingeigenvector.Solving
for the eigenvectorsandeigenvalues,we get:
> eigen(X)$vectors
𝑣1⃑⃑⃑⃑⃑ 𝑣2⃑⃑⃑⃑⃑ 𝑣3⃑⃑⃑⃑⃑ 𝑣4⃑⃑⃑⃑
Sepal.Length 0.36138659 -0.65658877 -0.58202985 0.3154872
Sepal.Width -0.08452251 -0.73016143 0.59791083 -0.3197231
Petal.Length 0.85667061 0.17337266 0.07623608 -0.4798390
Petal.Width 0.35828920 0.07548102 0.54583143 0.7536574
As youcan see,ourfirsteigenvector,betterknownasthe first
principle component,isdominatedbyPetal.Length,withavalue
of .85667. This meansthatPetal.Lengthcapturesthe most
variationinthe data for the firstdimension.So,if we wantedto
reduce our datasettoone variable,Petal.Lengthwouldbe the
bestchoice.
> eigen(X)$values
[1] 4.22824171 0.24267075 0.07820950 0.02383509
The eigenvaluesof the covariance matrix are able totell ushow
much variance isexplainedbyeacheigenvector. Notethatthe
firsteigenvalue,4.228,ismuch largerthan the followingthree.
Thus,the proportionof variance explainedbythe first
eigenvectoris equal to:
𝜆1
𝜆1+𝜆2+𝜆3+𝜆4
= .9246
Thismeansthat 92.46% of the data iscapturedby the first
principle component.If we wantedthe proportionof overall
variance explainedforthe firsttwoprinciplecomponentsjust
add 𝜆2 to the numeratorand the equationthenequals97.77%.
So,97.77% of the variance in the data containing4 variablescan
be explainedbythe firsttwoprinciple components.
7. Projection
Nowthat we have obtainedthe eigenvectorsandeigenvalues,it
istime to projectthe data ontofewerdimensions.Since we
computedabove thatthe firsttwo principle componentsmake
up almost98% of the variance inthe data, we will projectour
data ontothe firsttwoprinciple components.Inordertofind
the coordinatesbysolvingthe equationA =XV where Xisour
original matrix with4columnsand 150 rows (note thismatrix
has to be centeredwithmean= 0), A is the matrix of
coordinatesinthe new principle componentspace thatare
spannedbythe eigenvectorsin V.Remember,V isourmatrix of
eigenvectorsinthiscase.
X = scale(iris[1:4], center = TRUE, scale = FALSE)
scores = data.frame(X %*% eig$vectors)
colnames(scores) = c("Prin1", "Prin2", "Prin3", "Prin4")
scores[1:10, ]
Prin1 Prin2 Prin3 Prin4
1 -2.684126 -0.31939725 -0.02791483 0.002262437
2 -2.714142 0.17700123 -0.21046427 0.099026550
3 -2.888991 0.14494943 0.01790026 0.019968390
4 -2.745343 0.31829898 0.03155937 -0.075575817
5 -2.728717 -0.32675451 0.09007924 -0.061258593
6 -2.280860 -0.74133045 0.16867766 -0.024200858
7 -2.820538 0.08946138 0.25789216 -0.048143106
8 -2.626145 -0.16338496 -0.02187932 -0.045297871
9 -2.886383 0.57831175 0.02075957 -0.026744736
10 -2.672756 0.11377425 -0.19763272 -0.056295401
The commandsabove will give the coordinates,orscores,for
each principle component.Since we know about98% of the
variance inthe data is capturedbythe firsttwoprinciple
components,we willuse the firsttwocolumnsof coordinates
fromabove to plotour datasetin2 dimensionswiththe
followingcommand inR.The axesof thisnew twodimensional
projectionare the firsttwoprinciple components.
plot(scores$Prin1,scores$Prin2,main="Data ProjectedonFirst
2 Principal Components",
xlab= "FirstPrincipal Component",ylab="SecondPrincipal
Component",
col = c("green","red","blue")[iris$Species])
8. Note:The three differentcolorsrepresentthe speciesof iris
flower.
Conclusion
What was justaccomplishedwasthe exactgoal of PCA.We
were able to effectively reduce ourdatasetirisfromfour
dimensionsdownto twodimensions while maintainingnearly
98% of the original data.We were able todo thisbyusingthe
conceptsof eigenvaluesandeigenvectors. Toreview,we start
by settingupthe matrix forthe data whichhasthe observations
as rows,and variablesascolumns.The nextstepistocompute
the covariance matrix forthe data.The covariance matrix
resultsinan NxN matrix,where N isthe numberof variables.
The nextstepisto findthe eigenvectors andthe corresponding
eigenvaluesof the covariance matrix.The eigenvectorsmake up
the principle components.Next,itisimportanttoanalyze the
eigenvectorsandeigenvaluestosee how muchvariabilityis
accountedforby each component andto see whichvariable
contributesthe mostforeach eigenvector.Once itisdecided
howmany dimensionsyouwantyourprojectiontobe,the
scores,or coordinates,forthe new axesneedtobe obtained.
The final step isto plotthe data to see what the reduced
dimensionslooklikeandPCA issuccessfullycompleted!
Sources:
Hamilton,L.D. (n.d.).LauraDiane Hamilton.RetrievedDecember13,2016, from
http://www.lauradhamilton.com/introduction-to-principal-component-
analysis-pca
9. C. (2015). Principal ComponentAnalysis4Dummies:
Eigenvectors,EigenvaluesandDimensionReduction.
RetrievedDecember13, 2016, from
https://georgemdallas.wordpress.com/2013/10/30/prin
cipal-component-analysis-4-dummies-eigenvectors-
eigenvalues-and-dimension-reduction/
PRINCIPALCOMPONENTSANALYSIS.(n.d.).RetrievedDecember13,2016, from
http://www.bing.com/cr?IG=BE9D91B52E12482181305171F3DE2744&CID=29E
B5DDB568F60A634A7543057BE6161&rd=1&h=w3menlZrSNEgeE4CqkgKpvMgpi
xKBovnov7bpaVv7sg&v=1&r=http://www4.ncsu.edu/~slrace/LinearAlgebra2016
/RChapters/PCA.pdf&p=DevEx,5037.1