2. Conception
Principal component analysis(PCA) projects the feature onto the principal
components.
The motivation is to reduce the features dimensionality while only losing a
small amount of information.
3. Procedure:
The first principle component is just the normalized linear combination of the
variables that has the highest variance.
The second principal component has largest variance, subject to being
uncorrelated with the first.(It means the second principal component is
orthogonal with first one)
And so on.
4. Why we choose the direction with the
most variation
• Reason 1 : In signal analysis, they think
that signal has bigger variance and noise
has smaller variance.
• Reason 2 : As we can see our data
project on the green line (var = 0.6524)
that can separate data well rather than
purple line(var = 0.1678).
So choose the direction of PC with the most
variation in the data is our goal.
5. Example
DATA p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
x 2.5 0.5 2.2 1.9 3.1 2.3 2 1 1.5 1.1
y 2.4 0.7 2.9 2.2 3 2.7 1.6 1.1 1.6 0.9
We have 10 points(p1~p10) in two
dimensions as picture and table at
right side.
We want to use PCA to do dimension
reduce form 2 to 1.
6. First step: Zero-centered(去中心化)
Reason :We want to move the data center to original points. To make calculation
more clear without consider bias.
DATA p1 p2 p3 p4 p5
x 0.69 -1.31 0.39 0.09 1.29
y 0.49 -1.21 0.99 0.29 1.09
DATA p6 p7 p8 p9 p10
x 0.49 0.19 -0.81 -0.31 -0.71
y 0.79 -0.31 -0.81 -0.31 -1.01
7. Second step: calculate covariance matrix(計算共變異矩陣)
Reason :In probability theory and statistics, covariance is a measure of the joint
variability of two random variables.(衡量兩個變量的總體誤差)
The sign of the covariance therefore shows the tendency in the linear
relationship between the variables. Variables whose covariance is zero are
called uncorrelated.
In our case:
𝑐𝑜𝑣 =
0.616556 0.615444
0.615444 0.716556
8. Covariance matrix
The covariance matrix defines the
shape of the data. Diagonal spread
is captured by the covariance,
while axis-aligned spread is
captured by the variance.
9. Third step: calculate eigenvalue and eigenvector of covariance matrix(計算共
變異矩陣的eigenvalue跟eigenvector)
Reason: Want to find direction of principle component.
𝑒𝑖𝑔𝑒𝑛𝑣𝑎𝑙𝑢𝑒 = 0.049 1.284
𝑒𝑖𝑔𝑒𝑛𝑣𝑒𝑐𝑡𝑜𝑟 =
−0.735
0.678
0.678
0.735
Sort eigenvalue from large to small, also arrange
eigenvector follow eigenvalue.
0.678
0.735
1.284 0.049
0.678
0.735
−0.735
0.678
The direction in which the data varies the most actually falls along the green line. This is the direction with the most variation in the data, this is why it's the first principal component (direction).
The sum of square distances is the smallest possible.
The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.