SlideShare a Scribd company logo
1 of 1
Download to read offline
Introduction
Cluster Analysis and the Cost of Simplicity
Supervised vs. Unsupervised Discretization
Britney Cook and Reuben Hilliard
Department of Statistics and Analytical Sciences
Advisor: Dr. Jennifer Priestley
Procedure
After having cleaned all of the data, the next step was the transformation
of the variables. For the user-defined transformations an equal widths
logic was used for defining the ranks. This was done by looking at a
histogram of each variable and evaluating the range and distribution of
the data. Where the majority of the observations lied, the ranks were
assigned equal values and where it tailed off, the ranks were assigned
increasingly larger values to compensate for the distribution of the data.
Figure 1 is a plot of the user-defined ranks for the variable
BRAVGMOS, or the average number of months per bank revolving
account open. The exact values for each rank can be found in Table 1.
For the SAS-defined transformations an equal frequencies logic was
used for defining the ranks. This was automated using the PROC RANK
procedure, after specifying the desired number of ranks. The number of
ranks needed to be high enough so meaningful information was not
masked, but not too high that it became cumbersome. Ten ranks seemed
appropriate. Figure 2 is a plot of the SAS-defined ranks for the variable
BRAVGMOS. The exact values for each rank can be found in Table 2.
Wherever the difference in spread between ranks was insignificant,
determined by a flat line (user-defined) or a t-test (SAS-defined), the
two ranks were collapsed into one, which is why in Figure 2 there are
only 9 ranks. After finalizing the ranks, they were given ordinal values
resulting in the ordinal transformation for the user and SAS-defined
procedures. From the ordinal transformations, the odds and log of odds
transformations were created.
After undergoing all of the variable transformations for both the
supervised and unsupervised methods, the 441 variables were reduced to
the top 50 most significant in predicting the dependent variable
GOODBAD. Then for each of the six transformations, the top five
variables for that transformation were set aside for cluster analysis.
PROC LOGIT was first used to determine the profit per 1,000 of our
original prediction model, prior to clustering. Using these same
variables, PROC CLUSTER was then used to generate three criteria:
Cubic Clustering Criterion (CCC), Pseudo-F and Pseudo T-Squared,
which helped determine the optimal number of clusters. Where there
was a peak in the CCC, a peak in Pseudo-F and a dip in Pseudo T-
Squared, the optimal number of clusters would be found. It can be seen
in Figure 3 that these three events occur at five clusters. Next, PROC
FASTCLUS was used to cluster the data into five groups, followed by
PROC REPORT which generated the profitability per 1,000 for each of
the five clusters determined.
Using the clusters constructed, a transactional dataset was merged in
using a left join and MATCHKEY, a customer identifier. The top 20 SIC
codes from each cluster were then collected and used to determine
spending behaviors of the customers that comprised each cluster, shown
in Table 5. Next these trends were compared across clusters so that a
categorization of customer characteristics for both profitable and Non-
profitable clusters could be made. There were a couple of additional
procedures that were implemented to see if some variables, much like
spending characteristics, varied across clusters. One was a two-way
frequency table, Table 4, and the other a heat map, Figure 5. The results
of these procedures can be found in the following section.
Results
Figure 4, along with Figures 6, 7 & 8, clearly indicate that
unsupervised, or the most mathematically optimal approach, results in
models that profit more than those that utilized a supervised approach.
The log of odds transformation using the unsupervised approach
resulted in the most profitable top cluster at $222,078.40 per 1,000
customers. This was significantly more than the model with no
clustering which profited only $78,690.72 per 1,000, shown in Table 3.
After evaluating customer characteristics for each cluster in the most
profitable transformation, it was discovered that clusters containing
customers who use their credit cards for automated cash disbursement
and liquor store purchases profit $70,000-$150,000 less (about half)
than those that contained customers who do not. Also, customers in the
lowest profitable cluster used their credit card inside service stations
40% more often than customers in the highest profitable cluster.
Customers in the highest profitable cluster were also found to use their
credit card more than 50% more frequently at dine-in than at fast-food
restaurants, versus customers in the lowest profitable cluster which were
found to use their credit card the same amount of times at both. In Table
4 it can be see that the least profitable cluster, cluster 5, had the highest
percent TRATE3, or the number of accounts 60 days past due. Meaning
that cluster 5 had the highest number of people who had either 1, 2, 3 or
4 accounts past due. In Figure 5 it was found that for the top two most
profitable clusters, a higher BNKINQ2, or number of banking inquires
in the past 6 months, resulted in a predicted probability of GOODBAD
as low as 0.2. For the bottom two least profitable clusters, a higher
BNKINQ2 resulted in a predicted probability of GOODBAD only as
low as 0.5. In other words, the number of inquiries had little effect on
the predicted probability of default for customers in these profitable
clusters.
Conclusion
Upon discovering that the most complex transformation, using the
unsupervised approach was the most profitable, the question that needed
to be answered next was whether or not the ability to explain the
variables was more important than having the highest possible profit. To
answer this question an evaluation of the next highest profit and the
transformation from which it was derived would be analyzed to see if
the tradeoff was worth it. In Figure 4, the second highest profiting
transformation was the ordinal, using the unsupervised approach. This
transformation was the most simple and easy to explain of all the
transformations, but had about $19,000 less profit in its top cluster and
about $122,000 less in overall profit. This is the cost of simplicity and is
sometimes not a very easy call to make. Are having variables that can be
explained in simple terms really worth $122,000? When a decision
comes with such a high cost, it would be best to consult the client and
see what is most important to their organization. In our case however,
there is no client, and thus the decision was made to keep the complex
model in favor of the most profitable outcome, $759,363.34 per 5,000
customers.
Sample Code
Table 1: User-Defined
Ranks for BRAVGMOS
Table 2: SAS-Defined
Ranks for BRAVGMOS
Figure 2: Plot of SAS-Defined
Ranks for BRAVGMOS
Figure 1: Plot of User-Defined
Ranks for BRAVGMOS
When it comes to defining variable transformations there are two
general approaches that can be utilized: supervised (user-defined) and
unsupervised (SAS-defined). It would seem that the most
mathematically optimal method, unsupervised, produces models that
profit more than those based on variables built using a supervised
approach, which allows room for human interpretation and error.
However, the most profitable transformations can often result in a
cumbersome output of variables. This leaves one wondering, should the
ability to explain variables be sacrificed for the most profitable form? If
the answer is no, where then should the line be drawn? This concept,
known as the cost of simplicity, is one of heavy debate and will be
addressed within this presentation.
Another matter that will be looked into is the difference in profit
between the supervised and unsupervised methods. In a binary
classification model using logistic regression, 50 of 441 variables were
found to be significant in generating a credit risk score, labeled
GOODBAD; this variable had a value of either 0 (good) or 1 (bad). It
was of this 50 variable pool that the variables used in this analysis were
retrieved. A comparison of profits for both supervised and unsupervised
would be investigated for three different transformations: ordinal, odds
and log of odds. The differences between these methods and
transformation are defined in the following procedure.
Table 3: Profitability per 1,000
for Model and Top Cluster
Figure 4: Stacked Bar Plot of Profit by Transformation Grouped by Cluster
Figure 6: Bubble Plot of Cluster
Size by Cluster Profit (Ordinal)
Figure 7: Bubble Plot of Cluster
Size by Cluster Profit (Odds)
Figure 8: Bubble Plot of Cluster
Size by Cluster Profit (Log of Odds)
Teal = Supervised Olive = Unsupervised Yellow = Supervised Red = Unsupervised Green = Supervised Purple = Unsupervised
Figure 3: Cluster Analysis for the
Log of Odds (Unsupervised)
Table 6: Cluster Profitability for
Log of Odds (Unsupervised)
Table 5: SIC Code Comparison of Clusters for Log of Odds (Unsupervised)
Figure 5: Heat Map of BNKINQ2 by
Cluster for Log of Odds (Unsupervised)
Table 4: Two-Way
Frequency Table of TRATE3
by Cluster for Log of Odds
(Unsupervised)
Transforma)on	
   Model	
   Top	
  Cluster	
  
Ordinal	
  (S)	
   	
  $106,873.47	
  	
   	
  $183,952.57	
  	
  
Ordinal	
  (U)	
   	
  $108,180.97	
  	
   	
  $203,738.01	
  	
  
Odds	
  (S)	
   	
  $69,964.89	
  	
   	
  $173,323.94	
  	
  
Odds	
  (U)	
   	
  $65,514.19	
  	
   	
  $166,386.46	
  	
  
Log	
  of	
  Odds	
  (S)	
   	
  $72,017.65	
  	
   	
  $204,782.40	
  	
  
Log	
  of	
  Odds	
  (U)	
   	
  $78,690.72	
  	
   	
  $222,078.40	
  	
  
S	
  =	
  Supervised	
  	
  	
  	
  	
  U	
  =	
  Unsupervised	
  
	
  	
   TRATE3	
  
Cluster	
   0	
   1	
   2	
   3	
   4	
   Total	
  
1	
   31856	
  
4.23	
  
7627	
  
5.78	
  
6761	
  
0.90	
  
16.19	
  
5.18	
  
2133	
  
0.28	
  
5.11	
  
4.63	
  
744	
  
0.10	
  
1.78	
  
4.09	
  
271	
  
0.04	
  
0.65	
  
3.62	
  
41765	
  
5.55	
  
2	
  
173373	
  
23.02	
  
76.50	
  
31.48	
  
34990	
  
4.65	
  
15.44	
  
26.81	
  
11548	
  
1.53	
  
5.10	
  
25.05	
  
4670	
  
0.62	
  
2.06	
  
25.69	
  
2060	
  
0.27	
  
0.91	
  
27.53	
  
226641	
  
30.10	
  
3	
   65042	
  
8.64	
  
83.07	
  
11.81	
  
9736	
  
1.29	
  
12.43	
  
7.46	
  
2491	
  
0.33	
  
3.18	
  
5.40	
  
768	
  
0.10	
  
0.98	
  
4.23	
  
264	
  
0.04	
  
0.34	
  
3.53	
  
78301	
  
10.40	
  
4	
   59655	
  
7.92	
  
72.69	
  
10.83	
  
14824	
  
1.97	
  
18.06	
  
11.36	
  
5033	
  
0.67	
  
6.13	
  
10.92	
  
1855	
  
0.25	
  
2.26	
  
10.21	
  
696	
  
0.09	
  
0.85	
  
9.30	
  
82063	
  
10.90	
  
5	
  
220809	
  
29.32	
  
68.1	
  
40.09	
  
64207	
  
8.53	
  
19.80	
  
49.19	
  
24896	
  
3.31	
  
7.68	
  
54.00	
  
10140	
  
1.35	
  
3.13	
  
55.78	
  
4194	
  
0.56	
  
1.29	
  
56.01	
  
324243	
  
43.06	
  
Total	
  
550735	
  
73.14	
  
130518	
  
17.33	
  
46101	
  
6.12	
  
18177	
  
2.41	
  
7482	
  
0.99	
  
753013	
  
100.00	
  
Frequency	
  	
  	
  	
  Percent	
  	
  	
  	
  Row	
  Percent	
  	
  	
  	
  Column	
  Percent	
  
Average	
  
Independent	
  
Average	
  
Dependent	
  Rank	
   Frequency	
  
1	
   114440	
   10	
   0.25848	
  
2	
   129842	
   20	
   0.20406	
  
3	
   254353	
   31	
   0.181142	
  
4	
   123412	
   44	
   0.17609	
  
5	
   129910	
   51	
   0.16705	
  
6	
   114318	
   59	
   0.16102	
  
7	
   133391	
   69	
   0.15858	
  
8	
   125048	
   81	
   0.15309	
  
9	
   130715	
   111	
   0.12426	
  
Average	
  
Independent	
  
Average	
  
Dependent	
  Rank	
   Frequency	
  
1	
   176173	
   13	
   0.24165	
  
2	
   167186	
   24	
   0.18893	
  
3	
   829725	
   58	
   0.16499	
  
4	
   82345	
   120	
   0.11585	
  
Figures 4, 5, 6, 7, and 8 were all generated using SAS® Visual Analytics

More Related Content

What's hot

X18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsX18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsShantanu Deshpande
 
Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.Siddhanth Chaurasiya
 
Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryPranov Mishra
 
Automation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep LearningAutomation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep LearningPranov Mishra
 
Statistics_Regression_Project
Statistics_Regression_ProjectStatistics_Regression_Project
Statistics_Regression_ProjectAlekhya Bhupati
 

What's hot (6)

Classification Problem with KNN
Classification Problem with KNNClassification Problem with KNN
Classification Problem with KNN
 
X18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsX18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalytics
 
Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.
 
Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage Industry
 
Automation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep LearningAutomation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep Learning
 
Statistics_Regression_Project
Statistics_Regression_ProjectStatistics_Regression_Project
Statistics_Regression_Project
 

Similar to Final SAS Day 2015 Poster

Campaign response modeling
Campaign response modelingCampaign response modeling
Campaign response modelingEsteban Ribero
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industryskewdlogix
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network ModelEric Esajian
 
Reducing False Positives
Reducing False PositivesReducing False Positives
Reducing False PositivesMayank Johri
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and ClusteringUsha Vijay
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom IndustrySatyam Barsaiyan
 
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning Approach
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning ApproachReducing False Positives - BSA AML Transaction Monitoring Re-Tuning Approach
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning ApproachErik De Monte
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaRahul Bhatia
 
Building_a_Readmission_Model_Using_WEKA
Building_a_Readmission_Model_Using_WEKABuilding_a_Readmission_Model_Using_WEKA
Building_a_Readmission_Model_Using_WEKASunil Kakade
 

Similar to Final SAS Day 2015 Poster (20)

JEDM_RR_JF_Final
JEDM_RR_JF_FinalJEDM_RR_JF_Final
JEDM_RR_JF_Final
 
Campaign response modeling
Campaign response modelingCampaign response modeling
Campaign response modeling
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industry
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Reducing False Positives
Reducing False PositivesReducing False Positives
Reducing False Positives
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom Industry
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
 
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning Approach
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning ApproachReducing False Positives - BSA AML Transaction Monitoring Re-Tuning Approach
Reducing False Positives - BSA AML Transaction Monitoring Re-Tuning Approach
 
MidTerm memo
MidTerm memoMidTerm memo
MidTerm memo
 
Final Report
Final ReportFinal Report
Final Report
 
Cmt learning objective 36 case study of s&p 500
Cmt learning objective 36   case study of s&p 500Cmt learning objective 36   case study of s&p 500
Cmt learning objective 36 case study of s&p 500
 
Case study of s&p 500
Case study of s&p 500Case study of s&p 500
Case study of s&p 500
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Luis_Ramon_Report.doc
Luis_Ramon_Report.docLuis_Ramon_Report.doc
Luis_Ramon_Report.doc
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
Building_a_Readmission_Model_Using_WEKA
Building_a_Readmission_Model_Using_WEKABuilding_a_Readmission_Model_Using_WEKA
Building_a_Readmission_Model_Using_WEKA
 

More from Reuben Hilliard

More from Reuben Hilliard (6)

R_DAY_CLABSI
R_DAY_CLABSIR_DAY_CLABSI
R_DAY_CLABSI
 
Ecological Society of America 2015
Ecological Society of America 2015Ecological Society of America 2015
Ecological Society of America 2015
 
SAS Analytics Experience 2014
SAS Analytics Experience 2014SAS Analytics Experience 2014
SAS Analytics Experience 2014
 
SAS Analytics Experience 2016
SAS Analytics Experience 2016SAS Analytics Experience 2016
SAS Analytics Experience 2016
 
R_DAY_POSTER_F
R_DAY_POSTER_FR_DAY_POSTER_F
R_DAY_POSTER_F
 
SASDay2016
SASDay2016SASDay2016
SASDay2016
 

Final SAS Day 2015 Poster

  • 1. Introduction Cluster Analysis and the Cost of Simplicity Supervised vs. Unsupervised Discretization Britney Cook and Reuben Hilliard Department of Statistics and Analytical Sciences Advisor: Dr. Jennifer Priestley Procedure After having cleaned all of the data, the next step was the transformation of the variables. For the user-defined transformations an equal widths logic was used for defining the ranks. This was done by looking at a histogram of each variable and evaluating the range and distribution of the data. Where the majority of the observations lied, the ranks were assigned equal values and where it tailed off, the ranks were assigned increasingly larger values to compensate for the distribution of the data. Figure 1 is a plot of the user-defined ranks for the variable BRAVGMOS, or the average number of months per bank revolving account open. The exact values for each rank can be found in Table 1. For the SAS-defined transformations an equal frequencies logic was used for defining the ranks. This was automated using the PROC RANK procedure, after specifying the desired number of ranks. The number of ranks needed to be high enough so meaningful information was not masked, but not too high that it became cumbersome. Ten ranks seemed appropriate. Figure 2 is a plot of the SAS-defined ranks for the variable BRAVGMOS. The exact values for each rank can be found in Table 2. Wherever the difference in spread between ranks was insignificant, determined by a flat line (user-defined) or a t-test (SAS-defined), the two ranks were collapsed into one, which is why in Figure 2 there are only 9 ranks. After finalizing the ranks, they were given ordinal values resulting in the ordinal transformation for the user and SAS-defined procedures. From the ordinal transformations, the odds and log of odds transformations were created. After undergoing all of the variable transformations for both the supervised and unsupervised methods, the 441 variables were reduced to the top 50 most significant in predicting the dependent variable GOODBAD. Then for each of the six transformations, the top five variables for that transformation were set aside for cluster analysis. PROC LOGIT was first used to determine the profit per 1,000 of our original prediction model, prior to clustering. Using these same variables, PROC CLUSTER was then used to generate three criteria: Cubic Clustering Criterion (CCC), Pseudo-F and Pseudo T-Squared, which helped determine the optimal number of clusters. Where there was a peak in the CCC, a peak in Pseudo-F and a dip in Pseudo T- Squared, the optimal number of clusters would be found. It can be seen in Figure 3 that these three events occur at five clusters. Next, PROC FASTCLUS was used to cluster the data into five groups, followed by PROC REPORT which generated the profitability per 1,000 for each of the five clusters determined. Using the clusters constructed, a transactional dataset was merged in using a left join and MATCHKEY, a customer identifier. The top 20 SIC codes from each cluster were then collected and used to determine spending behaviors of the customers that comprised each cluster, shown in Table 5. Next these trends were compared across clusters so that a categorization of customer characteristics for both profitable and Non- profitable clusters could be made. There were a couple of additional procedures that were implemented to see if some variables, much like spending characteristics, varied across clusters. One was a two-way frequency table, Table 4, and the other a heat map, Figure 5. The results of these procedures can be found in the following section. Results Figure 4, along with Figures 6, 7 & 8, clearly indicate that unsupervised, or the most mathematically optimal approach, results in models that profit more than those that utilized a supervised approach. The log of odds transformation using the unsupervised approach resulted in the most profitable top cluster at $222,078.40 per 1,000 customers. This was significantly more than the model with no clustering which profited only $78,690.72 per 1,000, shown in Table 3. After evaluating customer characteristics for each cluster in the most profitable transformation, it was discovered that clusters containing customers who use their credit cards for automated cash disbursement and liquor store purchases profit $70,000-$150,000 less (about half) than those that contained customers who do not. Also, customers in the lowest profitable cluster used their credit card inside service stations 40% more often than customers in the highest profitable cluster. Customers in the highest profitable cluster were also found to use their credit card more than 50% more frequently at dine-in than at fast-food restaurants, versus customers in the lowest profitable cluster which were found to use their credit card the same amount of times at both. In Table 4 it can be see that the least profitable cluster, cluster 5, had the highest percent TRATE3, or the number of accounts 60 days past due. Meaning that cluster 5 had the highest number of people who had either 1, 2, 3 or 4 accounts past due. In Figure 5 it was found that for the top two most profitable clusters, a higher BNKINQ2, or number of banking inquires in the past 6 months, resulted in a predicted probability of GOODBAD as low as 0.2. For the bottom two least profitable clusters, a higher BNKINQ2 resulted in a predicted probability of GOODBAD only as low as 0.5. In other words, the number of inquiries had little effect on the predicted probability of default for customers in these profitable clusters. Conclusion Upon discovering that the most complex transformation, using the unsupervised approach was the most profitable, the question that needed to be answered next was whether or not the ability to explain the variables was more important than having the highest possible profit. To answer this question an evaluation of the next highest profit and the transformation from which it was derived would be analyzed to see if the tradeoff was worth it. In Figure 4, the second highest profiting transformation was the ordinal, using the unsupervised approach. This transformation was the most simple and easy to explain of all the transformations, but had about $19,000 less profit in its top cluster and about $122,000 less in overall profit. This is the cost of simplicity and is sometimes not a very easy call to make. Are having variables that can be explained in simple terms really worth $122,000? When a decision comes with such a high cost, it would be best to consult the client and see what is most important to their organization. In our case however, there is no client, and thus the decision was made to keep the complex model in favor of the most profitable outcome, $759,363.34 per 5,000 customers. Sample Code Table 1: User-Defined Ranks for BRAVGMOS Table 2: SAS-Defined Ranks for BRAVGMOS Figure 2: Plot of SAS-Defined Ranks for BRAVGMOS Figure 1: Plot of User-Defined Ranks for BRAVGMOS When it comes to defining variable transformations there are two general approaches that can be utilized: supervised (user-defined) and unsupervised (SAS-defined). It would seem that the most mathematically optimal method, unsupervised, produces models that profit more than those based on variables built using a supervised approach, which allows room for human interpretation and error. However, the most profitable transformations can often result in a cumbersome output of variables. This leaves one wondering, should the ability to explain variables be sacrificed for the most profitable form? If the answer is no, where then should the line be drawn? This concept, known as the cost of simplicity, is one of heavy debate and will be addressed within this presentation. Another matter that will be looked into is the difference in profit between the supervised and unsupervised methods. In a binary classification model using logistic regression, 50 of 441 variables were found to be significant in generating a credit risk score, labeled GOODBAD; this variable had a value of either 0 (good) or 1 (bad). It was of this 50 variable pool that the variables used in this analysis were retrieved. A comparison of profits for both supervised and unsupervised would be investigated for three different transformations: ordinal, odds and log of odds. The differences between these methods and transformation are defined in the following procedure. Table 3: Profitability per 1,000 for Model and Top Cluster Figure 4: Stacked Bar Plot of Profit by Transformation Grouped by Cluster Figure 6: Bubble Plot of Cluster Size by Cluster Profit (Ordinal) Figure 7: Bubble Plot of Cluster Size by Cluster Profit (Odds) Figure 8: Bubble Plot of Cluster Size by Cluster Profit (Log of Odds) Teal = Supervised Olive = Unsupervised Yellow = Supervised Red = Unsupervised Green = Supervised Purple = Unsupervised Figure 3: Cluster Analysis for the Log of Odds (Unsupervised) Table 6: Cluster Profitability for Log of Odds (Unsupervised) Table 5: SIC Code Comparison of Clusters for Log of Odds (Unsupervised) Figure 5: Heat Map of BNKINQ2 by Cluster for Log of Odds (Unsupervised) Table 4: Two-Way Frequency Table of TRATE3 by Cluster for Log of Odds (Unsupervised) Transforma)on   Model   Top  Cluster   Ordinal  (S)    $106,873.47      $183,952.57     Ordinal  (U)    $108,180.97      $203,738.01     Odds  (S)    $69,964.89      $173,323.94     Odds  (U)    $65,514.19      $166,386.46     Log  of  Odds  (S)    $72,017.65      $204,782.40     Log  of  Odds  (U)    $78,690.72      $222,078.40     S  =  Supervised          U  =  Unsupervised       TRATE3   Cluster   0   1   2   3   4   Total   1   31856   4.23   7627   5.78   6761   0.90   16.19   5.18   2133   0.28   5.11   4.63   744   0.10   1.78   4.09   271   0.04   0.65   3.62   41765   5.55   2   173373   23.02   76.50   31.48   34990   4.65   15.44   26.81   11548   1.53   5.10   25.05   4670   0.62   2.06   25.69   2060   0.27   0.91   27.53   226641   30.10   3   65042   8.64   83.07   11.81   9736   1.29   12.43   7.46   2491   0.33   3.18   5.40   768   0.10   0.98   4.23   264   0.04   0.34   3.53   78301   10.40   4   59655   7.92   72.69   10.83   14824   1.97   18.06   11.36   5033   0.67   6.13   10.92   1855   0.25   2.26   10.21   696   0.09   0.85   9.30   82063   10.90   5   220809   29.32   68.1   40.09   64207   8.53   19.80   49.19   24896   3.31   7.68   54.00   10140   1.35   3.13   55.78   4194   0.56   1.29   56.01   324243   43.06   Total   550735   73.14   130518   17.33   46101   6.12   18177   2.41   7482   0.99   753013   100.00   Frequency        Percent        Row  Percent        Column  Percent   Average   Independent   Average   Dependent  Rank   Frequency   1   114440   10   0.25848   2   129842   20   0.20406   3   254353   31   0.181142   4   123412   44   0.17609   5   129910   51   0.16705   6   114318   59   0.16102   7   133391   69   0.15858   8   125048   81   0.15309   9   130715   111   0.12426   Average   Independent   Average   Dependent  Rank   Frequency   1   176173   13   0.24165   2   167186   24   0.18893   3   829725   58   0.16499   4   82345   120   0.11585   Figures 4, 5, 6, 7, and 8 were all generated using SAS® Visual Analytics