ExcelR offers Data Science course in Hyderabad, the most comprehensive Data Science course in the market, covering the complete Data Science lifecycle concepts from Data Collection, Data Extraction, Data Cleansing, Data Exploration, Data Transformation, Feature Engineering, Data Integration, Data Mining, building Prediction models, Data Visualization and deploying the solution to the customer. Skills and tools ranging from Statistical Analysis, Text Mining, Regression Modelling, Hypothesis Testing, Predictive Analytics, Machine Learning, Deep Learning, Neural Networks, Natural Language Processing, Predictive Modelling, R Studio, XLMiner, Tableau, Spark, Hadoop, Minitab, programming languages like R programming, Python are covered extensively as part of this Data Science training. ExcelR is considered as the best Data Science training institute in Hyderabad which offers services from training to placement as part of the Data Science training program with over 400+ participants placed in various multinational companies including E&Y, Panasonic, Accenture, VMWare, Infosys, IBM, etc. ExcelR imparts the best Data Science training in Hyderabad and considered to be the best in the industry.
2. Application of dimension
reduction
• Computational advantage for other
algorithms
• Face recognition— image data (pixels)
along new axes works better for
recognizing faces
• Image compression
4. Input Output
Univ SAT Top10 Accept SFRatio Expenses GradRate PC1 PC2 PC3 PC4 PC5 PC6
Brown 1310 89 22 13 22,704 94
CalTech 1415 100 25 6 63,575 81
CMU 1260 62 59 9 25,026 72
Columbia 1310 76 24 12 31,510 88
Cornell 1280 83 33 13 21,864 90
Dartmouth 1340 89 23 10 32,162 95
Duke 1315 90 30 12 31,585 95
Georgetown 1255 74 24 12 20,126 92
Harvard 1400 91 14 11 39,525 97
JohnsHopkins 1305 75 44 7 58,691 87
Hope is that a fewer columns may
capture most of the information
from the original dataset
PCA
5. The Primitive Idea – Intuition
First
How to compress the data
loosing the least amount of information?
6. Input
• p measurements/
original columns
• Correlated
Output
• p principal components
(= p weighted averages
of original
measurements)
• Uncorrelated
• Ordered by variance
• Keep top principal
components; drop rest
PCA
7. Mechanism
The ith principal component is a weighted average of
original measurements/columns:
Weights (aij) are chosen such that:
1. PCs are ordered by their variance (PC1 has largest
variance, followed by PC2, PC3, and so on)
2. Pairs of PCs have correlation = 0
3. For each PC, sum of squared weights =1
pip2i21i1i XaXaXaPC +…++=
Univ SAT Top10 Accept SFRatio Expenses GradRate PC1 PC2 PC3 PC4 PC5 PC6
Brown 1310 89 22 13 22,704 94
CalTech 1415 100 25 6 63,575 81
CMU 1260 62 59 9 25,026 72
Columbia 1310 76 24 12 31,510 88
Cornell 1280 83 33 13 21,864 90
Dartmouth 1340 89 23 10 32,162 95
Duke 1315 90 30 12 31,585 95
Georgetown 1255 74 24 12 20,126 92
Harvard 1400 91 14 11 39,525 97
JohnsHopkins 1305 75 44 7 58,691 87
8. Demystifying weight computation
• Main idea: high variance = lots of information
• Goal: Find weights aij that maximize variance
of PCi, while keeping PCi uncorrelated to
other PCs.
• The covariance matrix of the X’s is needed.
jiPCPC
),XCov(Xaa),XCov(Xaa
)Var(X)Var(X)Var(X)Var(PC
ji
pp-ipi p-ii
pipiii aaa
≠=
+…++
++…++=
when,0),(Covarwant,Also
22 112121
2
2
2
21
2
1
pip2i21i1i XaXaXaPC +…++=
10. Standardization shortcut for PCA
• Rather than standardize the data
manually, you can use correlation
matrix instead of covariance matrix
as input
• PCA with and without standardization
gives different results!
11. PCA
Transform > Principal Components
(correlation matrix has been used here)
• PCs are uncorrelated
• Var(PC1) > Var (PC2)
> ...
pip2i21i1i XaXaXaPC +…++=
Scaled Data PC Scores
Principal
Components
12. Computing principal scores
• For each record, we can compute their score
on each PC.
• Multiply each weight (aij) by the appropriate
Xij
• Example for Brown University (using
standardized numbers):
• PC Score1 for Brown University = (–
0.458)(0.40) +(–0.427)(.64) +(0.424)(–0.87)
+(0.391)(.07) + (–0.363)(–0.32) + (–0.379)(.80) =
–0.989
13. R Code for PCA (Assignment)
OPTIONAL R Code
install.packages("gdata") ## for reading xls files
install.packages("xlsx") ## ” for reading xlsx files
mydata<-read.xlsx("University Ranking.xlsx",1) ## use read.csv for csv files
mydata ## make sure the data is loaded correctly
help(princomp) ## to understand the api for princomp
pcaObj<-princomp(mydata[1:25,2:7], cor = TRUE, scores = TRUE, covmat = NULL)
## the first column in mydata has university names
## princomp(mydata, cor = TRUE) not_same_as prcomp(mydata, scale=TRUE); similar, but different
summary(pcaObj)
loadings(pcaObj)
plot(pcaObj)
biplot(pcaObj)
pcaObj$loadings
pcaObj$scores
14. Goal #1: Reduce data dimension
• PCs are ordered by their variance (=information)
• Choose top few PCs and drop the rest!
Example:
• PC1 captures most ??% of the information.
• The first 2 PCs capture ??%
• Data reduction: use only two variables instead of
6.
15. Matrix Transpose
OPTIONAL: R code
help(matrix)
A<-matrix(c(1,2),nrow=1,ncol=2,byrow=TRUE)
A
t(A)
B<-matrix(c(1,2,3,4),nrow=2,ncol=2,byrow=TRUE)
B
t(B)
C<matrix(c(1,2,3,4,5,6),nrow=3,ncol=2,byrow=TRUE)
C
t(C)
16. Matrix Multiplication
OPTIONAL R Code
A<-
matrix(c(1,2,3,4,5,6),nrow=3,ncol=2,byrow=
TRUE)
A
B<-
matrix(c(1,2,3,4,5,6,7,8),nrow=2,ncol=4,byro
w=TRUE)
B
C<-A%*%B
D<-t(B)%*%t(A) ## note, B%*%A is not
possible; how does D look like?
18. Data Compression
[ ]
[ ]
[ ] [ ]
[ ]
[ ] [ ] [ ]omponentsPrincipalCPCScorestaedScaledDaApproximat
:ionApproximat
omponentsPrincipalCPCScores
omponentsPrincipalCPCScores
ScaledData
omponentsPrincipalCScaledDataPCScores
1
T
pc
cNpN
T
pppN
pp
pN
pN
pppNpN
×
××
××
−
×
×
×
×
××
×=
×=
×=
×=
c = Number of
components
kept; c ≤ p
19. Goal #2: Learn relationships with
PCA by interpreting the weights
• ai1,…, aip are the coefficients for PCi.
• They describe the role of original X
variables in computing PCi.
• Useful in providing context-specific
interpretation of each PC.
20. PC1 Scores
(choose one or more)
1. are approximately a simple average of
the 6 variables
2. measure the degree of high Accept &
SFRatio, but low Expenses, GradRate,
SAT, and Top10
21. Goal #3: Use PCA for visualization
• The first 2 (or 3) PCs provide a way to
project the data from a p-dimensional
space onto a 2D (or 3D) space
23. Monitoring batch processes
using PCA
• Multivariate data at different time points
• Historical database of successful batches
are used
• Multivariate trajectory data is projected
to low-dimensional space
>>> Simple monitoring charts to spot outlier
24. Your Turn!
1. If we use a subset of the principal
components, is this useful for
prediction? for explanation?
2. What are advantages and
weaknesses of PCA compared to
choosing a subset of the variables?
3. PCA vs. Clustering