Principal component analysis (PCA) is a technique used to reduce the dimensionality of data by transforming correlated variables into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. PCA involves computing the covariance matrix of the data and then determining the eigenvectors with the highest eigenvalues, which become the principal components.
2. 0.1 Prologue
Figure 1: Rotation of Axis
As shown in Figure 0.1
x = x cosα + y sinα
y = −x sinα + y cosα
0.2 Introduction
Principal component analysis is a data reduction technique. It was developed
by Hotelling in 1933. It reduces the number of dimensions. The new dimen-
sions are orthogonal to each other and they are the principal components.
0.3 Mathematics of PCA
Let us take a two-dimensional matrix, a data of 2 variables and n observa-
tions. Say the scatter plot of the data looks like what is shown in Figure
1
3. 0.3.
Xn×2 =
x11 x12
x21 x22
...
...
xd1 xd2
Figure 2: Principal Components in 2-D
Let us compute the covariance matrix of X and denote it as:
Cov(X1, X2) =
s11 s12
s21 s22
It is a 2 × 2 matrix. The scatter plot shows that there is a relationship
between X1 and X2. So, if we calculate the correlation matrix, we will get
non zero correlation coefficient.
Corr(X1, X2) =
1 r12
r21 1
We know that −1 ≤ r12 = r21 ≤ 1 and say r12
∼= 0.9. The variance of X1 is s11
and the variance of X2 is s22 which are sample variance. The corresponding
2
4. Figure 3: Original Data
population variance matrix is denoted as:
PopulationCov(X) =
σ11 σ12
σ21 σ22
The population variance of X1 is σ11 and that of X2 is σ22.
As shown in Figure , the variability of X1 and X2 (s11 and s22 respec-
tively)is not same but they are both high.
Now, we rotate the axis X1 and X2 by an angle θ. Let us call the new
axes Z1 and Z2. Now, let us see the variability along the new axes Z1 and
Z2. Let us draw a figure along the new axes Z1 and Z2 and call this trans-
formed data. We compare the two diagrams and see that V ariability(Z1)
V ariability(X1) and similarly, or rather even more so, V ariability(Z2)
3
5. Figure 4: Transformation
V ariability(X2).
And, in the transformed axes as shown in Fig 0.3:
•
V (Z1) V (Z2) (1)
Since the variability across Z2 is much less compared to Z1, then we
can say that the information content across Z2 dimension is very less.
In fact, we can ignore that information and we can just capture the
information along Z1. In statistical sense, information is nothing but
the variability/variance . Z1 alone is enough to give the information
4
6. Figure 5: Transformed Data
which was scattered all over X1 & X2. We can in fact ignore the di-
mension Z2, which will leave us with lesser number of dimension. That
is why PCA is essentially a dimensionality reduction technique. The
5
7. dimension reduction can be done for p-dimension matrix as well.
• Orthogonal Dimensions. The major and minor axes of the ellipse (of
the scatter plot) is parallel to the (transformed) coordinate axes which
was not the case earlier in the X1 & X2 axes system. When the major
and minor axes of the ellipse is not parallel to the coordinate axes,
it shows a dependency between the axes, X1 & X2. That is not the
case with the transformed axes Z1 & Z2 which shows that they are
independent, that they are orthogonal. Orthogonality is preserved
So, we are trying to develop a mathematical formulation where uncorrelated
data structure is transformed to an uncorrelated data structure (the axis of
the ellipse is parallel to the transformed coordinate axes) in a reduced di-
mension.
A reduction from p-dimension to a lesser number of dimension is what PCA
does. By p-dimension, it means an n × p matrix. p denotes the number of
variables, the number of columns of data. n denotes the number of observa-
tions, number of rows of data:
X =
x11 x12 x13 . . . x1p
x21 x22 x23 . . . x2p
...
...
...
...
...
xn1 xd2 xd3 . . . xnp
=
XT
1
XT
2
...
XT
p
6
8. We transform it from X space to Z space. Xn,p → Zn,q where q p. PCA is
a dimensionality reduction technique. It transforms the original data matrix
into a reduced components matrix and it preserves the orthogonality of the
components. Suppose we want to do a prediction models using multivariate
regression. The Xs are independent variables (IV). If these IVs are corre-
lated, then that would lead to the problem of multicollinearity. The linear
model will not work as the determinant of the matrix will be zero. One of the
solution to this problem is the ridge regression. But, if we can make them
independent by transformations then the variables would be truly indepen-
dent and then linear regression can be applied. This is one of the advantages
of PCA.
What is Principal?
What is the method? we will go by the same 2 × 2 matrix, with scatter plot
of the data in the shape of ellipse. Let us consider a point at random from
all the data points, say M with coordinates (x1, x2). The coordinates of the
point M on the transformed axis will be (Assuming that Z1 is at angle θ
counter clock wise from X1):
z1 : x1cosθ + x2sinθ (2)
z2 : −x1sinθ + x2cosθ (3)
which in mtrix form can be written as:
z1
z2
=
cosθ sinθ
−sinθ cosθ
x1
x2
(4)
7
9. or
Z = AT
X (5)
where,
Z =
z1
z2
(6)
A =
cosθ −sinθ
sinθ cosθ
(7)
and
X =
x1
x2
(8)
Now, if instead of p = 2 we have p = p, we will get:
Z =
z1
z2
...
zp
=
a11 a12 . . . a1p
a21 a22 . . . a2p
...
...
...
...
ap1 ap2 . . . app
x1
x2
...
xp
(9)
In terms of dimensions:
Zp×1 = [Ap×p]T
× Xp×1 (10)
We say that it is an orthogonal transformation. Let’s see now, how this
8
10. orthogonality is maintained. Let’s see with the two-dimension case.
A =
cosθ −sinθ
sinθ cosθ
= a1 a2 (11)
where
a1 =
cosθ
sinθ
(12)
and
a2 =
−sinθ
cosθ
(13)
Now,
aT
1 a1 = cosθ sinθ
cosθ
sinθ
= 1 (14)
Similarly,
aT
2 a2 = 1 (15)
Now,
AT
A =
cosθ sinθ
−sinθ cosθ
cosθ −sinθ
sinθ cosθ
(16)
=⇒ AT
A =
1 0
0 1
= AAT
= A−1
A (17)
9
11. =⇒ A−1
A = I (18)
which means, A is an orthogonal matrix, the off diagonal elements are 0.
This shows that the transformation that we are doing is an orthogonal trans-
formation. A is an orthogonal transformation matrix.
In case of p variables:
z1 = aT
1 x = a11X1 + a12X2 + · · · + a1pXp (19)
z2 = aT
2 x = a21X1 + a22X2 + · · · + a2pXp (20)
... = · · · = . . . . . . . . . . . . . . . (21)
zj = aT
j x = aj1X1 + aj2X2 + · · · + ajpXp (22)
... = · · · = . . . . . . . . . . . . . . . (23)
zp = aT
p x = ap1X1 + ap2X2 + · · · + appXp (24)
where,
aT
j aj = 1, j = 1, 2, . . . , p (25)
and
var(z1) ≥ var(z2) ≥ · · · ≥ var(zp) (26)
We want the first principal component in such a manner that it will explain
the maximum variance possible. Second principal component will explain
the next maximum variance followed (after taking into account the variance
10
12. already explained by the first principal component) and so on and so forth.
Now, the next step is, how to extract the principal components given the
data matrix X.
The jth
principal component, Zj is equal to at
jx. So,
V ar(Zj) = V ar(at
jx) = at
jV ar(x)aj (27)
as aj is a constant vector.
V ar(X) = Cov(X) = Σ =
σ11 σ12 . . . σ1p
σ21 σ22 . . . σ2p
...
...
...
...
σp1 σp2 . . . σpp
(28)
as X is multivariate data.
=⇒ V ar(Zj) = at
jΣaj (29)
Now,
E[Zj] = E[aT
j x] = aT
j E[x] = aT
j µ (30)
11
13. aT
j x ∼ (aT
j µ, at
jΣaj) (31)
If it is Normally distributed, then,
aT
j x ∼ N(aT
j µ, at
jΣaj) (32)
Σ and µ are population parameters. If they are known, then it is pop-
ulation principal component analysis. But, it is usually unknown. In such
a case X is used as the best estimate of µ and S, the sample covariance
matrix, is used as the estimate for σ. And, in that case the PCA is known
as sample principal component analysis. Since population parameters are
rarely known, we will be going ahead with sample PCA. In sample PCA:
E[zj] = E[aT
j X] = aT
j E[X] = aT
j X (33)
V ar(zj) = V ar(aT
j X) = aT
j Cov(X)aj = aT
j Saj (34)
If it is normally distributed then,
aT
j ∼ (aT
j X, aT
j Saj) (35)
Now, we will look at the principles followed to extract the PCs. Principles
of Principals:
• Each PC is a linear combination of X, a p×1 variable vector, i.e., aT
j X
• First PC is aT
1 X, subjected to aT
1 a1 = 1 that maximizes V ar(aT
1 X)
12
14. • Second PC is aT
2 X that maximizes V ar(aT
2 X) and subjected to aT
2 a2 =
1 and Cov(aT
1 X, aT
2 X) = 0
• The jth
PC is aT
j X that maximizes V ar(aT
j X) and subjected to aT
j aj =
1 and Cov(aT
j , aT
k X) = 0 for k < j
V ar(Z1) is the maximum variance in the data and aT
1 a1 = 1. V ar(Z2) is
the maximum variance in the data after V ar(Z1) has been removed and
aT
2 a2 = 1 and cov(a1X, a2X) = 0. Since the components are orthogo-
nal, therefore cov(a1X, a2X) = 0. V ar(Z3) is the maximum variance in
the data after V ar(Z1) and V ar(Z2)has been removed and aT
3 a3 = 1 and
cov(a1X, a3X) = 0, cov(a2X, a3X) = 0. And so on and so forth.
Now, the optimization problem is : Maximize V ar(Zj) = Maximize aT
j Saj
subject to aT
j aj = 1 which can be written as aT
j aj − 1 = 0.
Using Langragian:
Max L = aT
j Saj − λ(aT
j aj − 1) (36)
where λ is the lagrange multiplier. We need to find the value of aj such
that L is maximized. Using First Order Condition,
∂L
∂aj
= 0 (37)
=⇒ 2Saj − 2λaj = 0 (38)
=⇒ (S − λI)aj = 0 (39)
13
15. S is a matrix and λ is a scalar hence the identity matrix I. S is a p × p
matrix as their are p variables. λ is a scalar. I is a p × p identity matrix.
|S − λI| = 0 is the characteristic equation. If S is a p × p matrix, then the
characteristic equation has p roots which will be λ1 ≥ λ2 ≥ · · · ≥ λp. The
λs are the eigen values and the corrsponding eigen vectors give the values of
the corresponding principals aj. Each of the eigen value is the variance along
the corresponding dimension.
14