Implement principal component analysis (PCA) in python from scratch

Presented by
Eshan Agarwal
Implement Principal Component
Analysis(PCA) in python

How do we choose the
right features ?
Given a
classification
problem ….

 PCA is a method for reducing the dimensionality of data.
 It can be thought of as a projection method where data with m-columns
(features) is projected into a subspace with m or fewer columns, while
retaining the essence of the original data.
An PCA
Xn
km
Introduction to PCA

In this presentation, we will discover the PCA
method for dimensionality reduction and how to
implement it from scratch in Python.
 Before go in deep of PCA let us understand
some key points of PCA

 Variance
 The variance of each variable is the average squared deviation of its
n values around the mean of that variable. It can also think of as
spread of data points.
Geometric Rationale of PCA

 Covariance
Covariance of
variables i and j
Sum over all
n objects
Value of
variable i
in object m
Mean of
variable i
Value of
variable j
in object m
Mean of
variable j
 Degree to which the variables are linearly correlated is represented by
their covariances.
Geometric Rationale of PCA

Objective of PCA
 Objective of PCA is to rigidly rotate the axes of this m-dimensional space to new positions
(principal axes)
 PCA is ordered such that principal axis 1 has the highest variance, axis 2 has the next
highest variance .... , and axis p has the lowest variance

Implement PCA in Python (Scratch)
 Load the Data-Set :
 We can use Boston Housing dataset for PCA. Boston dataset has 13
features. So question here is how to visualize the data ?. We can
reduce the dimensions of data by using PCA and then visualize.

 Standardize data:
 PCA is largely affected by scales and different features might have different
scales. So it is better to standardize data before finding PCA components.
Sklearn’s StandardScaler scales data to scale of zero mean and unit variance.

The Algebra of PCA
 Calculating PCA involves following steps:
a. Calculating the covariance matrix.
b. Calculating the eigenvalues and eigenvector.
c. Forming Principal Components.
d. Projection into the new feature space.
a b dc+ + ++ =

 Calculating the covariance matrix (S) :
 Covariance matrix is a matrix of variances and covariances (or correlations) among
every pair of the m variables .
 It is square, symmetric matrix.
 Covariance matrix (S) = X.T * X, we can find it by using numpy matmul() function
in python.

Calculating the eigenvalues and eigenvector :
 ƛ is an eigenvalue for a matrix X if it is a solution of the characteristic
equation:
det( ƛ*I - A ) = 0
Where, I is the identity matrix of the same dimension as X.
 The sum of all m eigenvalues equals the trace of S (the sum of the variances of
the original variables).

 For each eigenvalue ƛ, a corresponding eigen-vector v, can be found by
solving :
( ƛ*I - A )v = 0
 The eigenvalues, 1, 2, ... m are the variances of the coordinates
on each principal component axis.
Calculating the eigenvalues and eigenvector :

 We are using scipy.linalg, which have eigh function for finding the top eigen-
values & eigen-vector, we are finding top 2 eigenvalues and eigenvectors as follow.
Code for finding eigenvalues and eigenvector :

Forming Principal Components :
 Below is code for forming principal components, formed by two principal eigen
vectors by vector-vector multiplication

 Projection into the new feature space :
 Creating a Data Frame having 1st principal & 2nd Principal components.

Steps for PCA
 Standardize the Data.
 Calculate the covariance matrix.
 Find the eigenvalues and eigenvectors of the covariance matrix.
 Plot the eigenvectors / principal components over the scaled data.

1) [ True or False ] PCA can be used for projecting and visualizing data in lower
dimensions.
A. TRUE
B. FALSE
2) We apply PCA on image dataset.
A. TRUE
B. FALSE
3) PCA is based on variance maximization and distance minimization.
A. TRUE
B. FALSE
 Implement PCA for number of components = 3 and then visualize data, also load
iris dataset and perform same task
Assessment and Evaluation
Ans:1-A,2-A,3-A

For full code : https://github.com/Eshan2203/PCA-on-Boston-House-price-Data-
Set/blob/master/PCA_BOston.ipynb

Implement principal component analysis (PCA) in python from scratch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Implement principal component analysis (PCA) in python from scratch

Similar to Implement principal component analysis (PCA) in python from scratch (20)

Recently uploaded

Recently uploaded (20)

Implement principal component analysis (PCA) in python from scratch

Editor's Notes