Principal Component Analysis and Cluster Analysis

PRINCIPAL COMPONENT ANALYSIS
Mohammed Sameer
2021-19-002
Department of Agricultural Statistics
Kerala Agricultural University

Data reduction technique developed by
Hotelling H
• Main Aim
• Lower the dimensions
• Orthogonality of new (transformed) dimensions
(principal components)

If correlated
Why only ellipse?

x1
x2
Scatter plot of the data with
original axis X1 and X2
(original data)
Shift the original axis to the
center of the data(mean)

x1
x2
Rotate the original axis
• Rotate X1(axis 1) by some
angle such that variability
of the data along that axis
is maximum
• Rotate X2(axis 2) such that
it is perpendicular to the
first axis and variability of
the data along that axis is
second maximum

Z1
Z2
• Scatter plot of transformed data
• Transformed axis Z1 and Z2

Variability of x1
Variability
of
x2
Original
Variance of x1 and x2 are large
X1 and x2 are correlated

Z1
Z2
Variability of z1
Variability
of
z2
Transformed axis
Variance of z2 is much smaller than
variance of z1
Z1 and z2 are uncorrelated

Red dots (projection of the original data points onto the rotating line)
The spread of the red dots will be maximum when it aligns with the pink mark(line)

Projection of points on to a line, the line is such that
The projected points has the greatest variability.
Projection of points on to a plane, the plane is such that
the spread of the points onto that plane is the greatest.

Principal Components
* First principal component is the direction of greatest
variability (covariance) in the data
* Second is the next orthogonal (uncorrelated) direction
of greatest variability
— So first remove all the variability along the first
component, and then find the next direction of
greatest variability and so on…

Principal Components Analysis
(PCA)
Principle
— Linear projection method to reduce the number of parameters
— Transfer a set of correlated variables into a new set of uncorrelated
variabies
— Map the data into a space of lower dimensionality
— Form of unsupervised learning
Properties
— It can be viewed as a rotation of the existing axes to new positions in the
space defined by original variables
— New axes are orthogonal and represent the directions with maximum
variability

Computing the components
• First center the data points
• Project the data points(vectors) onto an axis such that the variability
of the projected data points onto that axis is greatest.
• It turns out that the variability of x along the transformed axis is the
eigen values of cov(x) and the direction of the new axis is along the
eigen vectors of cov(x)

Dimensionality reduction
Choose only first p eigenvectors, based
On their eigenvalues
Final data set has only p dimensions

Bartlets test of sphericity
• H0: R=I
• H1: R not equal to I
In other words
H0: scatter plot is sort of sphere centered at origin
H1: scatter plot is not a sphere
• If scatterplot is a sphere, then no use of PCA
• If scatter plot is not a sphere( is ellipse/ellipsoid) then go
for PCA

• The results of the principal component analysis in milk production of
the state of Tamil Nadu revealed that milk production was having
positive relationship with the indigenous cattle population, she-
buffalo population, number of veterinary institutions, gross cropped
area, area under paddy. area under groundnut, native purebred cattle
population, graded and indigenous buffalo population, agricultural
labour population, crossbred cattle population, no. of financial
institutions and graded buffalo population.
• This suggests that effecting a shift in herd structure in favour of cross-
bred cows and graded buffalos can augment the milk production
potential.
Results

Introduction
• Cluster is a number of things of the same kind growing or joined
together
• A group of homogeneous things
The principle:
• Objects in the same group are similar to each other
• Objects in the different group are as dissimilar as possible

Cluster Analysis Model
partition
Obtain similarity or
dissimilarity
Objects to be
clustered
output
Cluster 1
Cluster 2
Cluster 3
Cluster k

Distance measures
Euclidean distance

Manhattan distance Manhattan(A,B)=

Clustering algorithms
• Hierarchical clustering
• Centroid-based clustering
• Graph-based clustering
• Density-based clustering

Single(nearest neighbour): distance between two clusters =
distance between two members of the two clusters

Farthest(complete) neighbour
Nearest distance
Farthest distance

Centroid : distance between multivariate means
of each clusters

OTHER JOINING ALGORITHMS
• AVERAGE
• MEDIAN
• WARD

objects 1 2 3 4 5
1 0
2 9 0
3 3 7 0
4 6 5 9 0
5 11 10 2 8 0
2
3,5 is made as one cluster

objects 5,3 1 2 4
5,3 0
1 3 0
2 7 9 0
4 8 6 5 0
3
1,3,5 is made as one cluster

Object 5 object 3 object 1 object 2 object 4
Dendogram

• Graph based clustering
• HCS (Highly Connected Subgraphs) clustering algorithm
• Points which are highly connected are clustered
Similarity graph

Density based clustering
The algorithms work via sliding windows moving toward the high density of points

How many clusters to retain?
At what stage I have to stop the algorithm.
Scree plot

• The cluster analysis was carried out based on area, production, and
productivity of different agricultural and horticultural crops which
were predominantly grown in the districts of Rajasthan
• calculated for two different periods 1980-1995 and 1996-2014
independently.

• Crop cluster based on area during 1980- 1995
• Crop cluster based on area during 1996- 2014
• Crop cluster based on production during 1980-1995
• Crop cluster based on production during 1996-2014
• Crop cluster based on productivity during 1990-1995
• Crop cluster based on productivity during 1996-2014

Conclusions
• From the present study we concluded that when the performance of
crop clusters based on area between two periods was compared, it
was evident that gram and cotton has shifted over the years in the
second period of study.
• When comparison of the performances of crop clusters based on
production between two periods was observed that gram, mustard &
rapeseed and cotton production shifted over the period.
• It means these crops were made cluster in the first period but not in
the second period. While wheat and bajra were the crops which
made clusters or had similarity in production across all the districts
of Rajasthan from first period to second period.

• The present study also concluded that horticultural crops had
similarity in productivity across all the districts of Rajasthan during
the both period.
• It means coriander, garlic and pea productivity included over the
years in the second period of the study. Only wheat and bajra were
the crops which had similarity in productivity across all the districts of
Rajasthan from first period to second period.

Principal Component Analysis and Cluster Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Principal Component Analysis and Cluster Analysis

Similar to Principal Component Analysis and Cluster Analysis (20)

More from Muhammed Ameer

More from Muhammed Ameer (20)

Recently uploaded

Recently uploaded (20)

Principal Component Analysis and Cluster Analysis

Editor's Notes