3. Principal Component Analysis
Rotates multivariate dataset into a new
configuration which is easier to interpret
Reducing dimensionality
Purposes
- simplify data
- look at relationships between variables
- look at patterns of objects (samples)
5. Principal Components Analysis
Y = A'X (1)
where
Y is the matrix of new variables (main components)
A is the matrix of the values of the orthonormal eigenvectors of matrix
C and
X is the data matrix
Transformation (1) is possible only after solving equation (2)
0
I
C
where
C is the variance-covariance matrix of order (kxk)
I is the unit matrix of order kxk, and
λ is the characteristic root of equation (2), called eigenvalue
6. Principal Components Analysis
From k original variables: x1, x2, ..., xk:
Produce k new variables: y1, y2, ..., yk:
y1 = a11x1 + a12x2 + ... + a1kxk
y2 = a21x1 + a22x2 + ... + a2kxk
...
yk = ak1x1 + ak2x2 + ... + akkxk
7. Principal Components Analysis
From k original variables: x1, x2, ..., xk:
Produce k new variables: y1, y2, ..., yk:
y1 = a11x1 + a12x2 + ... + a1kxk
y2 = a21x1 + a22x2 + ... + a2kxk
...
yk = ak1x1 + ak2x2 + ... + akkxk
such that:
yk's are uncorrelated (orthogonal)
y1 explains as much as possible of original variance of data
y2 explains as much as possible of remaining variance etc.
14. Principal Component Analysis:
Terminology
jth principal component is jth eigenvector of correlation/covariance matrix
coefficients, ajk, are elements of eigenvectors and relate original variables
(standardized if using correlation matrix) to components
scores are values of units on components (produced using coefficients)
amount of variance accounted for by component is given by eigenvalue, λj
proportion of variance accounted for by component is given by λj / Σ λj
loading of kth original variable on jth component is given by ajk√λj --
correlation between variable and component
15. How Many Components to Use?
If λj < 1 then component explains less variance than original variable
(correlation matrix)
Use 2 components (or 3) for visual ease
Scree diagram:
0
0.5
1
1.5
2
2.5
1 2 3 4 5
Component number
Eigenvalue
16. Principal Component Analysis on:
Covariance Matrix:
Variables must be in same units
Emphasizes variables with most variance
Mean eigenvalue ≠1.0
Useful in morphometrics, a few other cases
Correlation Matrix:
Variables are standardized (mean 0.0, SD 1.0)
Variables can be in different units
All variables have same impact on analysis
Mean eigenvalue = 1.0
17. PCA: Potential Problems
Lack of Independence
NO PROBLEM
Lack of Normality
Normality desirable but not essential
Lack of Precision
Precision desirable but not essential
Many Zeroes in Data Matrix
Problem (use Correspondence Analysis)
18. Procedure
for Principal Component Analysis
1. Decide whether to use correlation or covariance
matrix
2. Find eigenvectors (components) and eigenvalues
(variance accounted for)
3. Decide how many components to use by examining
eigenvalues (perhaps using scree diagram)
4. Examine loadings (perhaps vector loading plot)
5. Plot scores
6. Try rotation --- go to step 4
19. Chemical elements and their properties
1
Simbol
2
Grupa
3
Tt
4
Tf
5
d
6
NO
7
E
Li
Na
K
Rb
Cs
Be
Mg
Ca
Sr
F
Cl
Br
I
He
Ne
Ar
Kr
Xe
Zn
Co
Cu
Fe
Mn
Ni
Bi
Pb
Tl
Li 1 453.69 1615 534 1 0.98
Na 1 371 1156 970 1 0.93
K 1 336.5 1032 860 1 0.82
Rb 1 312.5 961 1530 1 0.82
Cs 1 301.6 944 1870 1 0.79
Be 2 1550 3243 1800 2 1.57
Mg 2 924 1380 1741 2 1.31
Ca 2 1120 1760 1540 2 1
Sr 2 1042 1657 2600 2 0.95
F 3 53.5 85 1.7 -1 3.98
Cl 3 172.1 238.5 3.2 -1 3.16
Br 3 265.9 331.9 3100 -1 2.96
I 3 386.6 457.4 4940 -1 2.66
He 4 0.9 4.2 0.2 0 0
Ne 4 24.5 27.2 0.8 0 0
Ar 4 83.7 87.4 1.7 0 0
Kr 4 116.5 120.8 3.5 0 0
Xe 4 161.2 166 5.5 0 0
Zn 5 692.6 1180 7140 2 1.6
Co 5 1765 3170 8900 3 1.8
Cu 5 1356 2868 8930 2 1.9
Fe 5 1808 3300 7870 2 1.8
Mn 5 1517 2370 7440 2 1.5
Ni 5 1726 3005 8900 2 1.8
Bi 6 544.4 1837 9780 3 2.02
Pb 6 600.61 2022 11340 2 1.8
Tl 6 577 1746 11850 3 1.62
20. Correlation matrix
Correlations (Elemente.sta) Marked correlations are significant
at p < .05000 N=27 (Casewise deletion of missing data)
Means Std.
Dev.
Tt Tf d NO E
Tt 676 593.6 1.000 0.938 0.573 0.705 0.188
Tf 1361.6 1095.1 1.000 0.671 0.811 0.182
d 3838.9 4068.7 1.000 0.684 0.339
NO 1.1 1.3 1.000 -0.107
E 1.4 1.0 1.000
21. Eigenvalues of correlation matrix
Eigenvalues of correlation matrix, and related
statistics (Elemente.sta) Active variables only
Eigenvalue % Total -
variance
Cumulative -
Eigenvalue
Cumulative
%
1 3.241 64.82 3.241 64.8
2 1.095 21.91 4.336 86.7
3 0.476 9.52 4.813 96.3
4 0.145 2.90 4.958 99.2
5 0.042 0.85 5.000 100.0
22. Eigenvectors of correlation matrix
Eigenvectors of correlation matrix (Elemente.sta)
Active variables only
Variable Factor 1 Factor 2 Factor 3 Factor 4 Factor 5
Tt 0.504 -0.037 0.552 -0.335 0.572
Tf 0.534 -0.058 0.313 -0.013 -0.783
d 0.457 0.203 -0.716 -0.487 0.018
NO 0.485 -0.338 -0.272 0.722 0.235
E 0.132 0.916 0.101 0.360 0.056
23. PC1-PC2 loading scatterplot
Projection of the variables on the factor-plane ( 1 x 2)
TtTf
d
NO
E
-1.0 -0.5 0.0 0.5 1.0
Factor 1 : 64.82%
-1.0
-0.5
0.0
0.5
1.0
Factor
2
:
21.91%
24. PC1-PC2 score scatterplot
Projection of the cases on the factor-plane ( 1 x 2)
Cases with sum of cosine square >= 0.00
Li
Na
K
Rb
Cs
Be
Mg
Ca
Sr
F
Cl Br
I
He
Ne
Ar
Kr
Xe
Zn
Co
Cu
Fe
Mn
Ni
Bi
Pb
Tl
-4 -3 -2 -1 0 1 2 3 4 5
Factor 1: 64.82%
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Factor
2:
21.91%
26. What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined
classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
27. K-means clustering
1. Select the numbers of clusters (K)
2. Randomly select three distinct data points
3. Measure the distance between points and initial clusters and assign them to the nearest one
4. Calculate the means of so formed clusters
4. Repeat the previous steps