SlideShare a Scribd company logo
1 of 10
Download to read offline
Principal Component Analysis
and Clustering
Professor Daymond
27-Nov-2016
UNDERSTANDING BORROWER SEGMENTS
Majority of the accounts are of credit based borrowers whose revolving utilization with the most
revolving accounts and bankcards
Credit based
accounts
The accounts are mostly with fixed instalments like car loans, student loans etc.,
Most instalment accounts and instalment utilization are the major factors of this segment
Fixed
Instalment
accounts
These are borrowers with past due records and most of the late fees of credit and loan amount. Also
with the recent history of delinquency this segment is medium risk
Past due
accounts
These are borrowers who are highly inquired for loans which exhibits the most credit card purchase
behaviour and attempt to try all possible loans for one
Highly
Inquired
accounts
Debt to collection accounts holds the most number of public records like tax liens etc.,
Collections money owed and tax liens are the major factors of this segment
With highest delinquency, exceeded usage of credit limit and multiple accounts in the recent times
makes this segment as high risk
Debt
Collections
accounts
High risk
delinquent
accounts
IDENTIFYINGTHEPRINCIPALCOMPONENTS
With the given dataset(N=27000) and 77 variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can be
envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the data
based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is executed with
all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the principal components
The variance of each principal component is implied Eigen values of the component. The greater the Eigen values, the better the variance is
explained by each component. Hence the break point criteria for components is that the Eigen values must be greater than 1 and the
cumulative variance should be at least 75%.
From the results(Appendix 1), it is observed that there are 18 components with Eigen values greater than 1 and contribute to approximately
76% of the total variance. The coefficients of the principal components are the Eigen vectors(Appendix 2) generally the linear combination of
the inputs which implies the axis length and the direction of each principal components
.From figure 1, scree plot it is observed that curve is almost flat after Eigen value 1 implying that the further components contribute very small
to the variance. Hence there are total of 18 principal components that provides a significant variance of data
Figure 1 Figure 2
INTERPRETINGTHEPRINCIPALCOMPONENTS
In order to interpret the principal components, the correlation matrix of the Eigen Vectors is observed for highest correlation with the original
variables. The data is standardized by PRINCOMP and hence the correlation matrix has values lesser than 1. The values closer to an absolute 1 i.e.
either positive or negative are said to be highly correlated with the original variables.
PRINCIPAL COMPONENT 1
From Figure 4, it is observed that the highest coefficients
are correlated with the various number of accounts i.e. how
valuable are the customers in terms of usage and the least
correlated with the duration since the recent account i.e.
how credible the customers are?
Similarly, each of the principal component is analysed for
the highest and the lowest coefficients and tabulated for
reference.
Figure 3
Figure 4
IDENTIFYINGTHECLUSTERS
Once the principal components are identified, the next step is to feed the principal components in to a cluster and run the FASTCLUS procedure with
various MAXCLUSTERS size ranging from 3 to 20 after PROC STDIZE. FASTCLUS uses k-means clustering, an iterative approach helps to identify the
approximately equal sized clusters with a decent spread. A set of values are selected as Initial Seeds for reference i.e. mean and then the nearest
values are formed as temporary clusters and replaced with the mean of new clusters and this is repeated iteratively until there is no change in
clusters. ‘Complete convergence is satisfied’ implies that the final SEEDS is equal to the
cluster mean.
Summary
The summary of statistics of clusters displays the frequency of observations in each
cluster and the root mean square deviation. The next column displays the largest
distance from the seed to the observation i.e. the total spread of the cluster
approximately. The last column displays the distance from the centre of the cluster
to the centre of the nearest cluster.
Six appropriate sized clusters are obtained with 14 clusters and at 35th iteration.
Cluster 1, 4, 6, 9, 12, 14 are the identified clusters and Cluster 1 is observed to be
the nearest cluster for all the clusters
Goodness-of-fit metrics
The higher values of Pseudo F Statistic are preferred to attain good number of
clusters
R-square accounts for the variance accounted by the clusters
The higher CCC values are indicate good clustering generally expected to be more
than 2 or 3.
Higher F Statistic and CCC implies that the clustering solution is good
IDENTIFYINGTHECLUSTERS
Cluster means and standard deviation of variables are displayed as part of FASTCLUS. Similar to identifying the principal components, each of the
cluster is analysed for higher and lower coefficients and understand the relation between the principal components and the cluster segments.
Figure 4
The clusters are analysed and derived with respect to the loan data
variables. Figure 4, displays the customer segment identified after
the analysis of the coefficient matrix. These are the major segments
of the loan data
• Credit based – revolving accounts
• Fixed instalment based loan accounts
• accounts who are mostly past due of credit and late fees
• accounts who are highly inquired
• accounts who more than 75% and creates many new accounts
Further PROC UNIVARIATE is executed with the new cluster dataset
and the output are approximately same with respect to the box plot.
Hence it is ensured that the segments are almost correct
Figure 6 Boxplot of Percentage greater than 75 over all clustersFigure 5 Boxplot of instalment accounts over all clusters
SCORINGTHENEWDATA
The new data is then scored with the old statistics and the segments are identified. The scoring of new data set consists of the following steps:
• The outputs stats from the PRINCOMP is used to score the new dataset
• The output from STDIZE is used as input to standardize the new scored dataset
• The output stat from the FASTCLUS is used as input stat for the new dataset
Figure 7 displays the frequency distribution of mean across the new and old dataset for comparison. It is observed that the clusters are
approximately the same and the segments have been identified correctly.
OLD DATA
NEW DATA
LEARNINGS
Identifying the principal components is complex and after clustering the same
gives a much more clear picture
With very less business knowledge, identifying the clusters and the segment
verification was difficult
Learnt how to write a macro to run the clusters from 3 to 20 and then identify the
best one from the batch
Use of UNIVARIATE was a revelation when my segments matched with the box
plot even though I am not sure if the segments are correct as such.
APPENDIX1–EIGENVALUESWHENCURVECHANGES
APPENDIX2–EIGENVECTORS OFFIRST10PRINCIPALCOMPONENTS

More Related Content

What's hot

Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysishktripathy
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Mohammed Musah
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component AnalysisSunjeet Jena
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisUmair Shafique
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis Peter Reimann
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysismlong24
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionsaba khan
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisgokulprasath06
 

What's hot (20)

Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysis
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)
 
Pca
PcaPca
Pca
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
PCA Final.pptx
PCA Final.pptxPCA Final.pptx
PCA Final.pptx
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Logistic Regression Analysis
Logistic Regression AnalysisLogistic Regression Analysis
Logistic Regression Analysis
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 

Viewers also liked

Steps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS softwareSteps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS softwareSwetha A
 
Colgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case AnalysisColgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case AnalysisUsha Vijay
 
Visual Merchandising - Marketing Research
Visual Merchandising - Marketing ResearchVisual Merchandising - Marketing Research
Visual Merchandising - Marketing ResearchUsha Vijay
 
Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...zukun
 
Regularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial DataRegularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial DataWen-Ting Wang
 
Hosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYIHosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYIHosting Dergi
 
Olena teliga pr.-konf.
Olena teliga pr.-konf.Olena teliga pr.-konf.
Olena teliga pr.-konf.TOBM Ternopil
 
Colgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision ToothbrushColgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision ToothbrushPriyadarsini Somasundaram
 
Ejercicio 2 programación algoritmos Valentino Spina.
Ejercicio 2 programación  algoritmos Valentino Spina.Ejercicio 2 programación  algoritmos Valentino Spina.
Ejercicio 2 programación algoritmos Valentino Spina.Valentino Spina
 
Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...zukun
 
fauvel_igarss.pdf
fauvel_igarss.pdffauvel_igarss.pdf
fauvel_igarss.pdfgrssieee
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemMichele Filannino
 
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdfKernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdfgrssieee
 
Different kind of distance and Statistical Distance
Different kind of distance and Statistical DistanceDifferent kind of distance and Statistical Distance
Different kind of distance and Statistical DistanceKhulna University
 

Viewers also liked (20)

Steps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS softwareSteps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS software
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Colgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case AnalysisColgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case Analysis
 
Visual Merchandising - Marketing Research
Visual Merchandising - Marketing ResearchVisual Merchandising - Marketing Research
Visual Merchandising - Marketing Research
 
Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...
 
Regularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial DataRegularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial Data
 
Hosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYIHosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYI
 
Olena teliga pr.-konf.
Olena teliga pr.-konf.Olena teliga pr.-konf.
Olena teliga pr.-konf.
 
Mi auto biografía
Mi auto biografíaMi auto biografía
Mi auto biografía
 
ting-cert-BI
ting-cert-BIting-cert-BI
ting-cert-BI
 
Colgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision ToothbrushColgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision Toothbrush
 
Ejercicio 2 programación algoritmos Valentino Spina.
Ejercicio 2 programación  algoritmos Valentino Spina.Ejercicio 2 programación  algoritmos Valentino Spina.
Ejercicio 2 programación algoritmos Valentino Spina.
 
Reglamento interno itei 2014
Reglamento interno itei 2014Reglamento interno itei 2014
Reglamento interno itei 2014
 
Panorama sobre Teste de Software
Panorama sobre Teste de SoftwarePanorama sobre Teste de Software
Panorama sobre Teste de Software
 
2° informe s. gabriel 2014
2° informe s. gabriel 20142° informe s. gabriel 2014
2° informe s. gabriel 2014
 
Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...
 
fauvel_igarss.pdf
fauvel_igarss.pdffauvel_igarss.pdf
fauvel_igarss.pdf
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problem
 
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdfKernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
 
Different kind of distance and Statistical Distance
Different kind of distance and Statistical DistanceDifferent kind of distance and Statistical Distance
Different kind of distance and Statistical Distance
 

Similar to Principal Component Analysis and Clustering

Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom IndustrySatyam Barsaiyan
 
Statistics final seminar
Statistics final seminarStatistics final seminar
Statistics final seminarTejas Jagtap
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...Smarten Augmented Analytics
 
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithminventionjournals
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningIRJET Journal
 
Final SAS Day 2015 Poster
Final SAS Day 2015 PosterFinal SAS Day 2015 Poster
Final SAS Day 2015 PosterReuben Hilliard
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET Journal
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET Journal
 
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...ijmvsc
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project reportGaurav Sawant
 
Predictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsPredictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsMichele Vincent
 
Predicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsPredicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsMichele Vincent
 

Similar to Principal Component Analysis and Clustering (20)

Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
 
Statistics final seminar
Statistics final seminarStatistics final seminar
Statistics final seminar
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
 
Building the Professional of 2020: An Approach to Business Change Process Int...
Building the Professional of 2020: An Approach to Business Change Process Int...Building the Professional of 2020: An Approach to Business Change Process Int...
Building the Professional of 2020: An Approach to Business Change Process Int...
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
 
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
 
Final SAS Day 2015 Poster
Final SAS Day 2015 PosterFinal SAS Day 2015 Poster
Final SAS Day 2015 Poster
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
 
Data Science Using Python
Data Science Using PythonData Science Using Python
Data Science Using Python
 
JEDM_RR_JF_Final
JEDM_RR_JF_FinalJEDM_RR_JF_Final
JEDM_RR_JF_Final
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
 
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report
 
Predictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsPredictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation Amounts
 
Predicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsPredicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation Amounts
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 

Recently uploaded (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 

Principal Component Analysis and Clustering

  • 1. Principal Component Analysis and Clustering Professor Daymond 27-Nov-2016
  • 2. UNDERSTANDING BORROWER SEGMENTS Majority of the accounts are of credit based borrowers whose revolving utilization with the most revolving accounts and bankcards Credit based accounts The accounts are mostly with fixed instalments like car loans, student loans etc., Most instalment accounts and instalment utilization are the major factors of this segment Fixed Instalment accounts These are borrowers with past due records and most of the late fees of credit and loan amount. Also with the recent history of delinquency this segment is medium risk Past due accounts These are borrowers who are highly inquired for loans which exhibits the most credit card purchase behaviour and attempt to try all possible loans for one Highly Inquired accounts Debt to collection accounts holds the most number of public records like tax liens etc., Collections money owed and tax liens are the major factors of this segment With highest delinquency, exceeded usage of credit limit and multiple accounts in the recent times makes this segment as high risk Debt Collections accounts High risk delinquent accounts
  • 3. IDENTIFYINGTHEPRINCIPALCOMPONENTS With the given dataset(N=27000) and 77 variables, it is important to reduce the data set to a smaller set of variables to derive a feasible conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the principal components The variance of each principal component is implied Eigen values of the component. The greater the Eigen values, the better the variance is explained by each component. Hence the break point criteria for components is that the Eigen values must be greater than 1 and the cumulative variance should be at least 75%. From the results(Appendix 1), it is observed that there are 18 components with Eigen values greater than 1 and contribute to approximately 76% of the total variance. The coefficients of the principal components are the Eigen vectors(Appendix 2) generally the linear combination of the inputs which implies the axis length and the direction of each principal components .From figure 1, scree plot it is observed that curve is almost flat after Eigen value 1 implying that the further components contribute very small to the variance. Hence there are total of 18 principal components that provides a significant variance of data Figure 1 Figure 2
  • 4. INTERPRETINGTHEPRINCIPALCOMPONENTS In order to interpret the principal components, the correlation matrix of the Eigen Vectors is observed for highest correlation with the original variables. The data is standardized by PRINCOMP and hence the correlation matrix has values lesser than 1. The values closer to an absolute 1 i.e. either positive or negative are said to be highly correlated with the original variables. PRINCIPAL COMPONENT 1 From Figure 4, it is observed that the highest coefficients are correlated with the various number of accounts i.e. how valuable are the customers in terms of usage and the least correlated with the duration since the recent account i.e. how credible the customers are? Similarly, each of the principal component is analysed for the highest and the lowest coefficients and tabulated for reference. Figure 3 Figure 4
  • 5. IDENTIFYINGTHECLUSTERS Once the principal components are identified, the next step is to feed the principal components in to a cluster and run the FASTCLUS procedure with various MAXCLUSTERS size ranging from 3 to 20 after PROC STDIZE. FASTCLUS uses k-means clustering, an iterative approach helps to identify the approximately equal sized clusters with a decent spread. A set of values are selected as Initial Seeds for reference i.e. mean and then the nearest values are formed as temporary clusters and replaced with the mean of new clusters and this is repeated iteratively until there is no change in clusters. ‘Complete convergence is satisfied’ implies that the final SEEDS is equal to the cluster mean. Summary The summary of statistics of clusters displays the frequency of observations in each cluster and the root mean square deviation. The next column displays the largest distance from the seed to the observation i.e. the total spread of the cluster approximately. The last column displays the distance from the centre of the cluster to the centre of the nearest cluster. Six appropriate sized clusters are obtained with 14 clusters and at 35th iteration. Cluster 1, 4, 6, 9, 12, 14 are the identified clusters and Cluster 1 is observed to be the nearest cluster for all the clusters Goodness-of-fit metrics The higher values of Pseudo F Statistic are preferred to attain good number of clusters R-square accounts for the variance accounted by the clusters The higher CCC values are indicate good clustering generally expected to be more than 2 or 3. Higher F Statistic and CCC implies that the clustering solution is good
  • 6. IDENTIFYINGTHECLUSTERS Cluster means and standard deviation of variables are displayed as part of FASTCLUS. Similar to identifying the principal components, each of the cluster is analysed for higher and lower coefficients and understand the relation between the principal components and the cluster segments. Figure 4 The clusters are analysed and derived with respect to the loan data variables. Figure 4, displays the customer segment identified after the analysis of the coefficient matrix. These are the major segments of the loan data • Credit based – revolving accounts • Fixed instalment based loan accounts • accounts who are mostly past due of credit and late fees • accounts who are highly inquired • accounts who more than 75% and creates many new accounts Further PROC UNIVARIATE is executed with the new cluster dataset and the output are approximately same with respect to the box plot. Hence it is ensured that the segments are almost correct Figure 6 Boxplot of Percentage greater than 75 over all clustersFigure 5 Boxplot of instalment accounts over all clusters
  • 7. SCORINGTHENEWDATA The new data is then scored with the old statistics and the segments are identified. The scoring of new data set consists of the following steps: • The outputs stats from the PRINCOMP is used to score the new dataset • The output from STDIZE is used as input to standardize the new scored dataset • The output stat from the FASTCLUS is used as input stat for the new dataset Figure 7 displays the frequency distribution of mean across the new and old dataset for comparison. It is observed that the clusters are approximately the same and the segments have been identified correctly. OLD DATA NEW DATA
  • 8. LEARNINGS Identifying the principal components is complex and after clustering the same gives a much more clear picture With very less business knowledge, identifying the clusters and the segment verification was difficult Learnt how to write a macro to run the clusters from 3 to 20 and then identify the best one from the batch Use of UNIVARIATE was a revelation when my segments matched with the box plot even though I am not sure if the segments are correct as such.