Binary Classification Final

Binary Classification Modeling
Using Logistic Regression to Build Credit Scores
Britney Cook and Reuben Hilliard
Supervised by Jennifer Lewis Priestley, Ph.D.
Kennesaw State University
Submitted April 24, 2015
to fulfill the requirements for STAT4330

2

Executive Summary
The initial objective of this project was to build a binary classification model to predict whether a
potential customer would default on a credit line or not. As the analysis progressed different
objectives arose where the choice could either be made to optimize the model in the most
mathematically appropriate manner justify it with a simpler alternative. It most cases it was
decided to take the most optimal route in order to increase profit. Ultimately the analysis ended
with a model comprised of variables in their raw, ordinal, odds and log of the odds form. This
model proved to be very complex and was difficult to explain. However, the final model which
contained only 10 variables was found to profit $113,956.31 per 1,000 customers. Furthermore,
when a cluster analysis was performed using 5 significant transformed variables, the profitability
per 1,000 customers almost doubled at $222,078.40. Interestingly enough the transformation that
resulted in the variables that where used to create these highly profitable clusters were all from
the log of odds unsupervised transformation or the most complex and difficult variable to
explain. Though this made interpreting the variables used very difficult, the objective to
maximize profitability had been met.
After building the final model the next objective was to further optimize the profitability by
setting a cut-off point for the probability of defaulting. A classification table and profitability
function were analyzed to try and distinguish the value of the cut-off point that would result in
the most optimal profitability. Looking at the profitability table output it was clear that there was
room for improvement in the areas of both the Type 1 Error, which resulted in a loss of
$42,400468.50, and Type 2 Error, which resulted in an opportunity cost of $26,909,750. Overall
the model profited $85,810,581.50.

3

Introduction
This research paper describes the process and results of developing a binary classification model,
using Logistic Regression, to generate Credit Risk Scores. These scores are then used to
maximize a profitability function.
The data for this project came from a Sub-Prime lender. Three datasets were provided:
• CPR. 1,462,955 observations and 338 variables. Each observation represents a unique
customer. This file contains all of the potential predictors of credit performance. The
variables have differing levels of completeness.

• PERF. 17,244,104 observations and 18 variables. This file contains the post-hoc
performance data for each customer, including the response variable for modeling –
DELQID.
• TRAN. 8,536,608 observations and 5 variables. This file contains information on the
transaction patterns of each customer.
Each file contains a consistent “MATCHKEY” variable which was used to merge the datasets.
The process for the project included:
Each of these processes will be discussed in turn.
Assignment
of

Dependent
Variable

Odds,
Correlation

and
Plots

Multicollinearity

assessment
using

Regression
and
VIF

Discretization

and

transformations

Sampling

Model

Development

Model

Evaluation

Data
Cleansing
and

Imputation

Data
Discovery
Variable

Preparation

Modeling

4

Data Discovery
Before any analysis could take place, the two datasets, CPR and PERF, needed to be merged. For
the merge, an identifier of the individual customer, labeled MATCHKEY, was used. Some of the
options for merging the data included left join, right join, outer join, and inner join.
Supposing that CPR was on the left and PERF was on the right as shown the diagram below: an
inner join would be most appropriate for the following reasons:
• A left join would result in some MATCHKEYs having no post-hoc performance data,
including our dependent variable, DELQID, or delinquent ID number, along with
CRELIM, or credit limit after getting approved for a credit line.
• A right join would result in some MATCHKEYs having no potential predictor data,
which is what will help determine credit performance.
• An outer join would result in no MATCHKEYs having both post-hoc performance and
potential predictor data.
Using an inner join would result in MATCHKEYs that contained both post-hoc performance and
potential predictor data, two essential pieces of information necessary in developing a binary
classification model that optimizes profitability.
After merging the data, it was soon discovered that the same MATCHKEY sometimes had
multiple DELQIDs as shown in Table 1 below.
Table 1: MATCHKEY/DELQID Problem
Discovered (first 10 observations)
Obs MATCHKEY CRELIM DELQID
1 1333324 800 0
2 1333324 800 0
3 1333324 800 0
4 1333324 800 1
5 1333324 800 1
6 1333324 800 1
7 1333324 800 2
8 1333324 800 3
9 1333324 800 4
10 1333324 800 5
CPR PERF

5

Because DELQID is the dependent variable, to continue the analysis, a single DELQID needed
to be assigned per MATCHKEY. Descriptions for the DELQID values follow.

• A DELQID of 0 indicated that either the individual had a new credit file or it was too
soon to tell if they would be a good customer or not.
• A DELQID of 1 indicated that the individual was in good standing.
• A DELQID of 2 indicated that the individual was one cycle late.
• A DELQID of 3 indicated that the individual was two cycles late. The variable continued
to follow this trend.
Options for deciding which DELQID to use included taking the best DELQID, worst DELQID,
median DELQID, mean DELQID or most recent DELQID. The most conservative approach was
to go with the worst DELQID. While this did increase the risk of making a Type II Error, or not
lending to a customer that would have paid back the money, it did decrease the risk of making a
Type I Error, or lending to a customer that would not have paid back the money.
For the procedure, the data was sorted by MATCHKEY and then by ascending DELQID. The
last DELQID, or worst DELQID, for each MATCHKEY was then kept and all others were
discarded. Each MATCHKEY now had a single DELQID value assigned as shown in Table 2
below.
Up until this point, only the two datasets, CPR and PERF, were merged, and each MATCHKEY
was assigned a single DELQID. Nothing had been done with regards to missing values, hence,
why observation 6 in Table 2 above contained missing values. The resulting dataset, after
merging CPR and PERF along with assigning a single DELQID to each MATCHKEY, had
1,255,429 observations and 357 variables.
Table 2: MATCHKEY/DELQID Problem
Resolved (first 10 observations)
Obs MATCHKEY CRELIM DELQID
1 1333324 800 6
2 1333329 1500 1
3 1333334 2000 0
4 1333410 3000 6
5 1333414 4400 0
6 1333433 - -
7 1333437 1390 1
8 1333443 2250 1
9 1333463 10000 0
10 1333538 3000 2

6

Given that this will be a binary classification model using logistic regression, DELQID needed to
be reconfigured into a binary variable labeled “GOODBAD”. This new dependent binary
variable had a value of either “0” or “1”, where “0” was defined as a DELQID value of 0-2,
which was considered good, and “1” was defined as a DELQID value of 3 or greater, which was
considered bad. In other words, if a customer had a new credit file, was in good standing or was
only one cycle late, their identifier, MATCHKEY, received a GOODBAD value of “0”. If the
customer was more than two cycles late, their MATCHKEY received a GOODBAD value of
“1”. The result of this reconfiguration is shown in Table 3 below.
Prior to the creation of this table, observations where DELQID were missing were deleted. This
explains why there were no missing observations in Table 3 above. In Table 4 below, the
descriptive statistics for the new response variable, GOODBAD, are listed. From the table it can
be seen that the majority of the response, or 82.43%, had a GOODBAD value of 0, while the
other 17.57% had a GOODBAD value of 1.
Next on the list was the matter of coded values. For example, RMS variables 2 digits long
contained values that ranged from 0-99. However, only values 0-92 were valid numerical values,
where 92 represented all numerical values 92 or greater. The values 93-99 were coded, meaning
that they stood for a particular status, or were defined as something non-numerical. This was a
Table 3: DELQID After Being Reconfigured
into a Binary Variable (first 10 observations)
Obs MATCHKEYCRELIM DELQID GOODBAD
1 1333324 800 6 1
2 1333329 1500 1 0
3 1333334 2000 0 0
4 1333410 3000 6 1
5 1333414 4400 0 0
6 1333437 1390 1 0
7 1333443 2250 1 0
8 1333463 10000 0 0
9 1333538 3000 2 0
10 133572 5500 0 0
Table 4: Descriptive Statistics for GOODBAD
GOODBAD Frequency Percent
Cumulative
Frequency
Cumulative
Percent
0 1034829 82.43 1034829 82.43
1 220600 17.57 1255429 100

7

problem because the software, SAS, would read all values, including the coded, as quantitative,
which in many cases led to very misleading statistics (i.e. mean, median). In Figure 1 below,
when the variable AGE was left untouched, SAS gave what looked to be an approximately
normal distribution, aside from the stack of outliers to the right of the graph.
It seemed odd that there were so few customers in their 80s and 90s but so many that were said
to be 100 years old. This observation was common among variables and in many cases was more
extreme (i.e. AFR39, or the number of auto finance trades 60+ days past due, where well over
half of the variable was coded). Deleting any observation where there was a coded value was not
an option because then there would be no observations left for analysis. For this reason, the
coded values needed to be imputed, or replaced with an actual numerical value that made sense.
Possibilities for imputation included the following.
• Stratified Imputation
• Regressed Imputation
• Mean-Based Imputation
• Median-Based Imputation
The best options would have been either stratified or regressed imputation, but with 300+
variables to impute it would be near impossible. Two other possibilities for imputing the coded
values included a mean or median based imputation. Because most of the variables, excluding
coded values, had a skewed distribution, a median-based imputation was the most appropriate
approach. In Figure 2 above, AGE is shown after imputing the coded values with the median.
For this particular case, the imputation normalized the data to a certain extent by bringing the
mean AGE closer to the median AGE while getting rid of coded values without having to delete
Figure 2: Histogram of AGE after
Imputation
Figure 1: Histogram of AGE prior to
Imputation

8

any data. This can be seen in Table 5 below. One thing to note, aside from the mean getting
closer to the median and the standard deviation getting smaller post imputation, is the maximum,
which went from 99 to 91. This was another result of imputing the coded values with the median.
To avoid having to do this manually for each of the 300+ variables, a macro would be used to
run through and impute values as needed for each variable, outputting a final dataset containing
all actual numerical values. The variables DELQID, MATCHKEY, GOODBAD, and CRELIM
were excluded from the macro because, unlike potential predictors, they served as identifiers,
and, had they been imputed they would have lost all meaning. The two options that were
adjusted with each run of the macro were PCTREM, or percent removed, and MSTD, or max
standard deviation. PCTREM specified the threshold for which a variable, after being imputed
with the median, was either kept or discarded. For example, a PCTREM value of 50% would
indicate that if more than 50% of the variable was coded, it needed to be removed from the
dataset. MSTD specified the number of standard deviations a value could be from the mean,
before being imputed. Table 6 below shows the results from the 13 macro executions, along with
the specifications for the two options PCTREM and MSTD.
Table 5: Descriptive Statistics for AGE Before and After imputation!
N N Miss Mean Minimum Median Maximum
Before 1255429 0 48.0520237 15.6329305 17 47 99
After 1255429 0 47.6310528 14.9511702 17 47 91
Standard
Deviation
Table 6: Macro Execution Results!
Percent
Removed
Max
Standard
Deviation
Variables
Remaining
50% 4 254
45% 4 225
40% 4 180
40% 3 180
40% 2 180
40% 1 180
39% 4 175
38% 4 157
37% 4 144
36% 4 144
35% 4 144
30% 4 144
25% 4 144

9

There did not appear to be any difference in the variables remaining as to whether the macro was
executed at a MSTD value of 1 versus a MSTD value of 4, probably because there was not a
significant difference in extreme values being imputed between a MSTD value of 1 and a MSTD
value of 4. However, it can be seen in Table 7 below, that there was a difference between the
MSTD value’s descriptive statistics.
Although a MSTD value of 3 would have resulted in less spread or variation within variables, a
MSTD value of 4 was used so that the potential impact of outliers was not masked.
For the PCTREM, there was a clear break between 38% and 37% in which decreasing the
percentage no longer made a difference in the variables remaining. This break can be viewed in
Table 6 on the previous page. Between these two percentages, there was a difference of 14
variables, listed in Table 8 below.
Table 8: Variables Lost Using 37% vs. 38% PCTREM
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BROPENEX 0.9523 0.6143 0.1237 bropenex
BRRATE1 0.8853 0.7536 0.4654 brrate1
BRTRADES 0.9521 0.6139 0.1241 brtrades
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
Variable Description
DCCR39 NUMBER OF DEPT STORE TRADES/CURENTLY 60+ DAYS
DCCR49 NUMBER OF DEPT STORE TRADES/CURENTLY 90+ DAYS
DCCRATE7 NUMBER OF DEPT STORE ACCTS/CURRENTLY BAD DEBT
DCLAAGE AGE OF DEPT STORE/LAST ACTIVITY
DCN90P24 NUMBER OF 90+,BAD DEBT/DEPT STORE IN 24 MONTHS
DCR29 NUMBER OF DEPT STOR TRADES/EVER 30 DAYS OR WORSE
DCR39 NUMBER OF DEPT STORE TRADES/EVER 60 DAYS OR WORSE
DCR49 NUMBER OF DEPT STORE TRADES/EVER 90 DAYS OR WORSE
DCR7924 NUMBER OF DEPT STORE ACCTS/BAD DEBT PAST 24 MO
DCR29P24 NUMBER OF DEPT STORE TRADES/RATD 2-9 RPTD IN 24 MO
DCR39P24 NUMBER OF DEPT STORE TRADES/RATD 3-9 RPTD IN 24 MO
DCRATE79 NUMBER OF DEPT STORE ACCTS/EVER BAD DEBT
DCTRADES NUMBER OF DEPT STORE ACCTS
Table 7: Descriptive Statistics at 40% Imputed with 3 & 4 Standard Deviations
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
Variable with the Lowest 1-R^2 Ratio
g 37% Versus 38%
Variable Mean Std Dev Min Max Mean Std Dev Min Max
BRBAL 5257.05 5774.08 0 3064 5545.46 6411.36 0 38782
CRDPTH 143.37 88.95 0 442 147.26 95.7 0 540
LOCINQS 1.62 1.85 0 9 1.72 2.08 0 12
TRADES 18.46 9.77 1 49 18.67 10.11 1 59
TSBAL 16258.08 15427.16 0 74632 16978.59 16807.17 0 93603
3 Standard Deviations 4 Standard Deviations

10

Following the review of this table, noting that the 14 variables were all related to department
store cards, it was decided that these variables might be useful in assessing credit risk. Therefore
the PCTREM value was set at 38% to avoid the loss of these potential predictors.
For the final imputation, if a variable had values that surpassed 4 standard deviations from the
mean, those values would be imputed with the median. If more than 38% of values for that
variable were coded, the variable would be dropped. The main reason for this was because
imputing much more than 38% of total values with the median will reduce the variance
significantly, not giving us much to work with and making it hard to draw meaningful
conclusions from the data.
Figures 3 and 4 below show the distribution of the variable DCTRADES, or the number of
department store accounts, before and after the macro. The problem with DCTRADES is similar
to that of the variable AGE, but on a much grander scale where about 38% of the data was
coded.
After imputing all of the coded values for each of the potential predictors, the 15 PERF variables
(including MATCHKEY and CRELIM), AGE and BEACON were removed from the dataset so
that a Variable Cluster Analysis could be performed. The reasons for the removal of these
variables were as follows:
• PERF variables were post-hoc performance data, meaning that the variables had no value
unless somebody had already established a line of credit.
• AGE is considered discriminative and could not be used in the model.
• BEACON, the description of which was unknown because it was not listed in the RMS
Variables Spreadsheet, contained all missing values in the dataset.
Figure 3: Distribution of DCTRADES
Before Macro
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
Figure 4: Distribution of DCTRADES
After Macro
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^
1 BRN90P24 0.9285 0.6214
BRR39P24 0.9285 0.6236
2 BRCRATE1 0.9533 0.7265
BROPENEX 0.9523 0.6143
BRRATE1 0.8853 0.7536
BRTRADES 0.9521 0.6139
3 CRATE79 0.9522 0.6085
TRATE79 0.9523 0.6097
TRCR39 0.9552 0.7232
TRCR49 0.9747 0.704
TRR49 0.9167 0.8119
4 TOPEN12 0.8658 0.4535
TOPEN24 0.8658 0.3365
5 DCCR49 0.9383 0.6677
DCCRATE7 0.9841 0.6427
DCRATE79 0.9843 0.6436
6 TOPENB50 0.9516 0.6277
TOPENB75 0.9516 0.6268
7 BRR324 0.8515 0.5044
BRRATE3 0.8515 0.4324
8 BRR4524 0.8852 0.5106
BRRATE45 0.8852 0.4901
Variable with the Lowest 1-R^2
g 37% Versus 38%

11

The analysis was initially performed using the SAS PROC VARCLUS statement at the
maximum 140 clusters on the remaining 140 variables. As part of the diagnostic process, both
the dendrogram and output below in Figure 5, indicated that 80 clusters would be a reasonable
number to perform the analysis with. This value indicated a change in rate between 60 and 100,
as the slope flattened. It also explained ~90% of the proportion of variation. By default, if the
option “Maxclusters=” is discarded from the VARCLUS procedure, SAS will optimize the
number of clusters created. For this particular case, 31 was the optimal number of clusters,
determined by the second eigenvalue being less than one for each group of variables (Liau, Tan
and Khoo, 2011)1
. For the purposes of this course however, a suitable minimum number of
variables were needed to continue into the next phase of the modeling process.
Following the output of the 80 clusters, the variable in each cluster with the lowest 1-R2
Ratio
was selected. The lower the 1-R2
Ratio, the better the representation of that variable in
explaining the information in its cluster. Table 9 on the follow page is an output of the first 8
clusters containing the variable with the lowest 1-R2
Ratio highlighted.

1

Liau, A., Tan, T., & Khoo, A. (2011). Scale Measurement: Comparing Factor Analysis and Variable Clustering.
Figure 5: Distribution Curve of Variable Cluster Analysis
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%

12

Additionally, in this table are the R-squared with own cluster and R-squared with Next Closest.
R-squared with Own Cluster was the amount of variance within that cluster, explained by the
variable. R-squared with Next Closest was the amount of variance in the next cluster, explained
by the variable. The formula listed above indicates that, if a 1-R2
Ratio was low, a variable
needed to have a high R-squared with Own Cluster value and a low R-squared with Next Closest
value. In other words, the best representative variable for a cluster needs to be able to explain the
majority of variance for that cluster and little to no variance for the preceding cluster.
Next was the matter of multicollinearity. Multicollinearity was a problem because if more than
one variable explaining the same information was represented in the model, the signs for beta
Table 9: Variable Cluster Analysis at 80 Cluster Max (first 8 clusters)
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
Cluster Variable Own Cluster Next Closest 1-R2
Ratio Variable Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
R-Squared with
Variable with the Lowest 1-R2
Ratio

13

coefficients may have been reversed. This could have been very dangerous, causing the output of
the response to be wrong and leading towards a decision that might be very costly in the long
run. To try and reduce as much error as possible, Variance Inflation Factors (VIF) were analyzed
to help identify redundant variables. A VIF of 10 has been a common threshold in practice and is
suggested by some, Chatterjee & Price (1991)2
for example, to be large enough to indicate a
potential problem. For this reason it was decided to keep variables with a VIF 10 or less. To find
these variables, a PROC REG statement with the /VIF option was performed. The output was
then exported into Excel and the variables were sorted by VIF in ascending order and any found
with a VIF greater than 10 were discarded, as seen in Table 8. These variables were then
matched to the 80 clustered variables with the lowest 1-R2
Ratio. During this step 15 variables
did not match because of either or both of these reasons.
• The variable had a VIF larger than 10.
• The variable did not have the lowest 1-R2
Ratio value in its respective cluster.
Table 10 below displays a sample output of the variables, including their VIF and cluster in
which they belonged.
Something interesting to note of the 65 variables that remained in the dataset was that it included
DCTRADES, as shown in Table 8 above. Earlier in the data cleansing and imputation stage,
there was the option of allowing the macro to keep variables that were either 38% or 37%
imputed. At the time, the decision was made to continue with 38%, which included the 14
variables listed in Table 8 that were thought to possibly be useful in the model. DCTRADES was
one of those variables and a perfect example of the cascading effect that certain decisions have
throughout the process.

2

Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley.

Variable VIF Cluster Found
BRMINB 1.54646 47
DCWCRATE 1.5761 31
DCTRADES 1.59662 39
BRMINH 1.6268 23
COLLS 1.66598 25
Table 10: Sample Variance Inflation
Factor Output

14

Variable Preparation
After all the variables had been cleaned and the matter of multicollinearity was addressed, the
distribution for each variable needed to be reviewed. Most of the variables did not have normal
distributions and needed to be transformed in order to find the optimal mathematical form in
which the variable could predict the response, GOODBAD. There were two different ways this
was done.
• Discretization 1, which was user-defined and exercised equal widths logic.
• Discretization 2, which was SAS-defined and exercised equal frequencies logic.
For each of these approaches, three different monotonic transformations took place.
• Ordinal
• Odds =
• Logodds = log
The following is the process that was used to transform each of the 65 potential predictors that
currently remained in the dataset, with a few exceptions that will be mentioned towards the end.
For this demonstration, the variable, AVGMOS, or the number of months the account has been
open, will be used. Note the process did start with Discretization 2 as it was much more
involved.
Before any transformation took place, the descriptive statistics for the variable were looked over
to make sure that the variable was in fact clean. Below in Table 11, are the descriptive statistics
for the variable, AVGMOS. Some things to look for were missing values, coded values and the
difference between the mean and median. For the variable, AVGMOS, there were no missing
values, no coded values (values 193-199), and the mean and median were relatively close to one
another. After checking all of these attributes, the variable, AVGMOS, could undergo
transformation.
Discretization 2: SAS-Defined
For the ordinal transformation, the data first needed to be sorted in an ascending fashion. It was
then divided into a specified number of groups, for this analysis the default of 10 was used, each
with an equal frequency count of observations. In other words, each of the 10 groups, or ranks,
Table 11: Descriptive Statistics for AVGMOS
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
N N Miss Mean Minimum Median Maximum
1255429 0 58.263 29.883 0 56 182
Standard
Deviation

15

should have 10% of the total observations within it. SAS however, was not always able to
capture exactly 10% of the data, as show in Table 12 below.
While each rank for the variable, AVGMOS, was close to 10%, they did vary slightly. For some
variables, the percentage of data in each rank varied a lot. The main reason for this was because
of ties or when more than 10% of a variable was imputed, giving more than 10% of the data the
same value. SAS, not being able to break those values up, therefore had to compensate and
distribute the remaining data the best it could over the remaining ranks.
Below in Table 13, is the summary for the ranks defined above. This table was important with
regards to the interpretation of the variable, in this case AVGMOS, and its relationship to the
dependent variable, GOODBAD. The following were particular statistics of note.
RANK Frequency Percent
0 118172 9.41 118172 9.41
1 129966 10.35 248138 19.77
2 122021 9.72 370159 29.48
3 131796 10.5 501955 39.98
4 111748 8.9 613703 48.88
5 135386 10.78 749089 59.67
6 123489 9.84 872578 69.5
7 129508 10.32 1002086 79.82
8 126079 10.04 1128165 89.86
9 127264 10.14 1255429 100
Cumulative
Frequency
Cumulative
Percent
Table 12: SAS-Defined Ranks for AVGMOS
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
RANK avg_indp avg_dep std_indp std_dep min_indp min_dep max_indp max_dep
0 14 0.21029 4 0.40751 0 0 20 1
1 26 0.1892 3 0.39167 21 0 30 1
2 35 0.19507 3 0.39626 31 0 39 1
3 44 0.19612 3 0.39706 40 0 48 1
4 52 0.18726 2 0.39012 49 0 55 1
5 59 0.17673 2 0.38144 56 0 63 1
6 67 0.17325 2 0.37847 64 0 71 1
7 76 0.16318 3 0.36953 72 0 81 1
8 88 0.14845 4 0.35555 82 0 96 1
9 117 0.12109 18 0.32624 97 0 182 1
Table 13: Summary of SAS-Defined Ranks for AVGMOS
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7

16

• avg_indep, in this case, was the average number of months an account had been open for
that rank.
• avg_dep was the average probability of default for that rank.
From rank 0 it can be seen that customers who have had an account open between 0 and 20
months, had, on average, a 21.03% chance of defaulting. Following ave_indep and avg_dep
down the ranks, there appeared to be a trend.
• As the average number of months an account had been open for increased, the probability
of default decreased.
This was a good example of a relationship a potential predictor needed to have, with regards to
the response variable, in order to successfully predict the probability of default.
After the initial ranks had been defined, again using equal frequency logic, a SAS macro was
implemented to remove any non-meaningful differences in the variable. Because the only
differences of interest were sequential differences, consecutive t-tests were performed between
each rank and the rank following it. If there was a significant difference between the two ranks,
they would remain two separate groups. If however, there was not a significant difference, the
two ranks would be combined into one, becoming the rank to be tested against the following
rank. Table 14 below, is the summary for the newly defined ranks after all the t-tests had been
performed and each distinguishing sequential differences had been noted. It can be seen that
because there was not a statistical difference between ranks 2 and 3, rank 2 was collapsed onto
rank 3.
In Figure 6 below, a small dip occurs from rank 0 to rank 3. From rank 3 to rank 9 however,
there was a consistent slope with a consistent direction, the first of two objectives to be achieved
through the ordinal transformation. The other was spread, or the difference in probability of
default from the lowest rank to the highest. Looking at the same figure, a spread of 9% can be
found between rank 0 and rank 9.
Table 14: Summary of SAS-Defined Ranks for AVGMOS After Macro
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
RANK avg_indp avg_dep std_indp std_dep min_indp min_dep max_indp max_dep pvalue
0 14 0.21029 3 0.40751 0 0 20 1 0
1 26 0.1892 3 0.39167 21 0 30 1 0.000186
3 40 0.19562 2 0.39668 31 0 48 1 0
4 52 0.18726 2 0.39012 49 0 55 1 0
5 59 0.17673 2 0.38144 56 0 63 1 0.020006
6 67 0.17325 3 0.37847 64 0 71 1 0
7 76 0.16318 4 0.36953 72 0 81 1 0
8 88 0.14845 18 0.35555 82 0 96 1 0
9 117 0.12109 18 0.32624 97 0 182 1 0

17

Overall, the variable, AVGMOS, looked as if it might be a reasonable predictor for the response,
GOODBAD. One additional process was performed before finalizing the ranks and assigning the
ordinal codes. Again, looking at the Figure 6 above, it can be seen that because of rank 1, there
was not a consistent trend across the graph, which would have be the most ideal. Because the
spread between ranks 1and 3 was not more than 1%, they were collapsed into one as shown in
Figure 7 below.
Figure 6: Plot of SAS-Defined Ranks for AVGMOS After Macro
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
Figure 7: Plot of SAS-Defined Ranks for AVGMOS After Collapse
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324

18

After the ranks had been finalized, they were then assigned an ordinal code. This needed to be
done for two reasons.
1. If an ordinal variable with a meaningful beta coefficient had a first rank = 0, the variable
would drop out. Hence, the first rank 0.
2. Because this was an ordinal variable, the ranks had to be consecutive or the
transformation would not be valid.
The assignment of ordinal codes for the variable, ORDEQAVGMOS, is shown below, satisfying
these two requirements.
• If rank = 0 then ORDEQAVGMOS = 1
It was by these ordinal codes that the ordinal version of the variable, in this case
ORDEQAVGMOS, was defined. The results of the ordinal transformation can be found in Table
15 below.
The next transformation of interest was an odds transformation. This odds transformation was
performed on the newly created ordinal variable, in this case ORDEQAVGMOS, and was
defined as follows.
ORDEQAVGMOS Frequency Percent
1 118172 9.41 118172 9.41
2 383783 30.57 501955 39.98
3 111748 8.9 613703 48.88
4 135386 10.78 749089 59.67
5 123489 9.84 872578 69.5
6 129508 10.32 1002086 79.82
7 126079 10.04 1128165 89.86
8 127264 10.14 1255429 100
Cumulative
Frequency
Cumulative
Percent
Table 15: Ordinal Transformation for AVGMOS (SAS-Defined)
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50

19

The odds for the variable, ORDEQAVGMOS, can be found in Table 16 below.
The last transformation of interest was a log odds transformation. This was an attempt to
linearize the odds relationship with the response, and can be defined as follows.

The log of the odds for the variable, ORDEQAVGMOS, can be found in Table 17 below.
These same three transformations: ordinal, odds and log odds, were then performed using
Discretization 1, described in the following pages.
Table 16: Odds Transformation for ORDEQAVGMOS (SAS-Defined)
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
ODSEQAVGMOS Frequency Percent
0.137779049 127264 10.14 127264 10.14
0.174335426 126079 10.04 253343 20.18
0.194998847 129508 10.32 382851 30.5
0.209561776 123489 9.84 506340 40.33
0.214670866 135386 10.78 641726 51.12
0.23040673 111748 8.9 753474 60.02
0.243189366 383783 30.57 1137257 90.59
0.266282334 118172 9.41 1255429 100
Cumulative
Frequency
Cumulative
Percent
Table 17: Log Odds Transformation for ORDEQAVGMOS (SAS-Defined)
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
LODSEQAVGMOS Frequency Percent
-1.982103969 127264 10.14 127264 10.14
-1.7467741 126079 10.04 253343 20.18
-1.634761635 129508 10.32 382851 30.5
-1.562736708 123489 9.84 506340 40.33
-1.538649282 135386 10.78 641726 51.12
-1.467909142 111748 8.9 753474 60.02
-1.413914857 383783 30.57 1137257 90.59
-1.323198126 118172 9.41 1255429 100
Cumulative
Frequency
Cumulative
Percent

20

Discretization 1: User-Defined
For the ordinal transformation, a histogram of the variable was observed to determine two
aspects of interest.
1. The distribution and how the values of that variable were spread over the x-axis
2. The range of the x-axis
The above information was then used to determine the number of ranks that would be defined, as
well as the width of each rank. Below in Figure 8, is a histogram of the variable, AVGMOS. It
can be seen that the variable had a fairly strong right-skewed distribution and a range from 0-182
months. With regards to the number of ranks that would be defined, a number, small enough so
that the information was not too cumbersome, but large enough that no important information
would be missed, needed to be considered. For the variable, AVGMOS, 10 ranks seemed to have
met the criteria.
Next was the matter of deciding upon the widths of each rank. While equal widths would have
been ideal, for variables where the majority of the data favors one side, it is not exactly
appropriate. The reason being that if all the widths were set equal to each other, for variables
such as AVGMOS, it could result in some ranks containing somewhere around 25% of the data
and others containing <5%. It is for this reason, in the case of AVGMOS, that the lower ranks, or
ranks on the left side where the majority of the data was, were given a smaller width, and the
Figure 8: Histogram of AVGMOS
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%

21

higher ranks, or the ranks on the right side, were given larger widths. The defined rank widths for
AVGMOS resulted in the frequencies found below in Table 18 where the descriptive statistics
for each rank are also displayed.
In Table 19 below, the descriptive statistics for the response variable, GOODBAD, by rank, can
be found.
Using the information from Table 18 and Table 19, some inferences concerning the relationship
between the variable, AVGMOS, and the response, GOODBAD, can be made.
Looking at Tables 18 and 19, it can be seen that customers who have had an account open
between 90 and 104 months, had, on average, a 13.87% chance of defaulting. A similar trend as
Table 18: Descriptive Statistics of the User-Defined Ranks for AVGMOS
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
ORDAVGMOS N Obs N Mean Std Dev Minimum Maximum
1 54331 54331 10.2065303 2.8664601 0 14
2 180627 180627 22.4557071 4.2157593 15 29
3 206636 206636 37.127456 4.3294262 30 44
4 241613 241613 52.1963802 4.2963834 45 59
5 232273 232273 66.7771717 4.3066704 60 74
6 163718 163718 81.368982 4.2693602 75 89
7 89023 89023 96.1588241 4.2635659 90 104
8 44035 44035 111.107664 4.2797616 105 119
9 26794 26794 128.073412 5.6539986 120 139
10 16379 16379 154.744063 11.4205595 140 182
Table 19: Descriptive Statistics of GOODBAD by User-Defined Ranks for AVGMOS
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
ORDAVGMOS N Obs N Mean Std Dev Minimum Maximum
1 54331 54331 0.2234268 0.4165458 0 1
2 180627 180627 0.1922968 0.3941061 0 1
3 206636 206636 0.1966501 0.3974665 0 1
4 241613 241613 0.1854991 0.3887027 0 1
5 232273 232273 0.1729474 0.3782026 0 1
6 163718 163718 0.1571055 0.3639013 0 1
7 89023 89023 0.1387057 0.3456411 0 1
8 44035 44035 0.1211763 0.3263358 0 1
9 26794 26794 0.1118907 0.3152378 0 1
10 16379 16379 0.1037304 0.3049198 0 1

22

observed in the SAS-defined ranks with relation to the response variable, GOODBAD, is stated
as the following.
• As the average number of months an account had been open for increased, the probability
of default decreased.
The above trend can be better visualized by the following figure, Figure 9.
Similar to what was seen in the plot of SAS-defined ranks after the macro, there is a dip in the
lower ranks prior to the consistent trend seen after. Again, being that the spread between the
ranks, in this case ranks 2 and 3, is no more than 1%, rank 2 was collapsed onto rank 3, and
resulted in the output of Figure 10 below, where there was now a consistent trend across all of
the ranks.
Figure 9: Plot of User-Defined Ranks for AVGMOS
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%

23

The ranks defined above then became the ordinal codes for the variable AVGMOS; it was by
these ordinal codes that the ordinal version of the variable, ORDAVGMOS, was defined. The
results of the ordinal transformation can be found in Table 20 below.
Next was the odds transformation. Again, this was performed on the newly created ordinal
variable, ORDAVGMOS, and was defined as follows.

Lastly was the log odds transformation. Again, this was an attempt to linearize the odds
relationship with the response, and can be defined as follows.
Table 20: Ordinal Transformation for AVGMOS (User-Defined)
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
ORDEQAVGMOS Frequency Percent
1 54331 4.33 54331 4.33
2 387263 30.85 441594 35.17
3 241613 19.25 683207 54.42
4 232273 18.5 915480 72.92
5 163718 13.04 1079198 85.96
6 89023 7.09 1168221 93.05
7 44035 3.51 1212256 96.56
8 26794 2.13 1239050 98.7
9 16379 1.3 1255429 100
Cumulative
Frequency
Cumulative
Percent
Figure 10: Plot of User-Defined Ranks for AVGMOS After Collapse
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%

24

The odds and log odds for the variable, ORDAVGMOS, can be found in Table 21 below.
While this was the process applied to the majority of the 65 potential predictors, there were a few
exceptions.
Variables such as ORRATE3 for example, shown in Figure 11 below, were inherently binary in
nature. Because of this, it did not make much sense to perform an ordinal transformation. Even
in the attempt to order a binary variable in an equal frequency fashion, the PROC RANK
procedure would fail, thus eliminating any execution of the Discretization 2 transformations.
Therefore, the variable could only undergo an odds and log odds transformation on the raw
variable in Discretization 1.
Variables such as OT3PTOT, shown in Figure 12 below, were converted into a binary variable
because again, the ordinal transformation did not make much sense. In the case of OT3PTOT,
70% of the data was comprised of the same value, 0. Because of this, SAS could only break the
data into 3 ranks: the first, which captured 70% of the data and the following two, which
captured the remaining 30%. This could not follow the equal frequency logic intended for the
Discretization 2 transformations. The Discretization 1 ordinal transformation had the same issue
in that no matter what width the ranks were given, the first rank would always capture at least
70% of the data. The decision was then made to convert the variable into binary form, where an
observation could either take on a value of “0” or “>0”. The new binary variable would then
undergo an odds and log odds transformation on the raw variable in Discretization 1.
Table 21: Odds and Log odds Transformations for ORDAVGMOS (User-Defined)
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
ORDAVGMOS _FREQ_ avg_ind avg_dep ODSAVGMOS LODSAVGMOS
1 54331 10 0.22343 0.28771 -1.24581
2 387263 30 0.19462 0.24165 -1.42027
3 241613 52 0.1855 0.22775 -1.47953
4 232273 67 0.17295 0.20911 -1.56488
5 163718 81 0.15711 0.18639 -1.67992
6 89023 96 0.13871 0.16104 -1.82608
7 44035 111 0.12118 0.13788 -1.98134
8 26794 128 0.11189 0.12599 -2.07157
9 16379 155 0.10373 0.11574 -2.15645

25

There were a few cases in which a variable displayed two distinctly different relationships, as in
BRMINB for example, its distribution seen in Figure 13 below. After defining the widths of the
ranks and plotting the newly defined variable, as shown in Figure 14, the two relationships were
clear. From rank 1 to rank 3 there is a positive relationship with an approximate 17% spread and
from rank 3 to rank 5 there is a negative relationship with an approximate 9% spread. So not
only were the two relationships distinctively different, they were strong too. Variables that
displayed this kind of pattern were noted because, if after further analysis the variables showed
Figure 11: Histogram of ORRATE3
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
Figure 12: Histogram of OT3PTOT
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%

26

significance in predicting the response, GOODBAD, they were likely to be split into two
different variables.
Finally, certain variable were just inconsistent in nature, as in BRNEW for example, shown below in
Figure 15. Following in Figure 16, a plot of the SAS-defined ranks is displayed where it can be seen that
there was nothing consistent concerning the trend or spread. This was an issue and a fairly strong
indicator that the variable would perform poorly in predicting the response, GOODBAD. It would be
likely that this variable would not be used in our final model.
After
all variables had been transformed, the dataset had a new total of 451 variables.
Figure 13: Histogram of BRMINB
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
Figure 14: Plot of User-Defined Ranks
for BRMINB
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^
1 BRN90P24 0.9285 0.6214
BRR39P24 0.9285 0.6236
2 BRCRATE1 0.9533 0.7265
BROPENEX 0.9523 0.6143
BRRATE1 0.8853 0.7536
BRTRADES 0.9521 0.6139
3 CRATE79 0.9522 0.6085
TRATE79 0.9523 0.6097
TRCR39 0.9552 0.7232
TRCR49 0.9747 0.704
TRR49 0.9167 0.8119
4 TOPEN12 0.8658 0.4535
TOPEN24 0.8658 0.3365
5 DCCR49 0.9383 0.6677
DCCRATE7 0.9841 0.6427
DCRATE79 0.9843 0.6436
6 TOPENB50 0.9516 0.6277
TOPENB75 0.9516 0.6268
7 BRR324 0.8515 0.5044
BRRATE3 0.8515 0.4324
8 BRR4524 0.8852 0.5106
BRRATE45 0.8852 0.4901
Variable with the Lowest 1-R^2
g 37% Versus 38%
Figure 15: Histogram of BRNEW
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
Figure 16: Plot of SAS-Defined Ranks
for BRNEW
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^
1 BRN90P24 0.9285 0.6214
BRR39P24 0.9285 0.6236
2 BRCRATE1 0.9533 0.7265
BROPENEX 0.9523 0.6143
BRRATE1 0.8853 0.7536
BRTRADES 0.9521 0.6139
3 CRATE79 0.9522 0.6085
TRATE79 0.9523 0.6097
TRCR39 0.9552 0.7232
TRCR49 0.9747 0.704
TRR49 0.9167 0.8119
4 TOPEN12 0.8658 0.4535
TOPEN24 0.8658 0.3365
5 DCCR49 0.9383 0.6677
DCCRATE7 0.9841 0.6427
DCRATE79 0.9843 0.6436
6 TOPENB50 0.9516 0.6277
TOPENB75 0.9516 0.6268

27

Modeling
Before the modeling process began, the data was split in two resulting in a training dataset,
which was used to build the model, and a validation dataset, which was used to score the data.
One reason for this was to help generalize the model by making sure that it was not over-fitted,
or contorted to accommodate the influential observations specific to the data from which the
model was built. This was done by scoring the trained model with the validation dataset and
seeing if the results were approximately the same, implying that the model was stable. The
training dataset was created by pulling a simple random sample from the master dataset that was
now comprised of 451 variables and 1,255,429 observations. Because the proportion of
GOODBAD needed to be the same, the data was first sorted on the variable GOODBAD and a
seed was assigned so that the sample could be recreated if need be. As for deciding upon how
much data would be used to comprise the sample, given that neither dataset should be
significantly larger than the other, a 40/60 split would be utilized where 40% of the data became
the training dataset and the remaining 60% became the validation dataset. This resulted in a
training dataset of 502,416 observations and a validation dataset of 753,013.
It was from this training dataset that the logistic model was built using PROC LOGISTIC. A
backward selection was run on all of the 451 variables, each iteration deleting any insignificant
variables in predicting the probability of a 1, or the probability that the potential customer would
default, until all of the variables remaining were significant. This dropped the total number of
variables down to 177 and resulted in the ROC curve shown in Figure 17 below, where a c-
statistic, note C = (% concordance + ½(% ties)), of 0.880 can be found.
Figure 17: ROC Curve for the Full Model
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324

28

The percent concordant shown in Table 22 below is the percent of pairs, or combinations of 2
observations: one with a GOODBAD=1 and one with a GOODBAD=0, where the predicted
probability of a 1 is lower for the observation that is a true 0 than it is for the observation that is a
true 1. Percent discordant is the percent of pairs where the predicted probability of a 1 is higher
for the observation that is a true 0 than it is for the observation that is a true 1. The higher the
percent concordant the better, and given a c-statistic of 0.880, it can be concluded that from this
model 88% of pairs were correctly predicted.
To address any issues of multicollinearity, where there were multiple transformations of a
variable that proved to be significant, the most significant or the one with the highest Chi-Square
value was retained. At this point, variables with duplicate transformations were dropped
retaining only the highest Chi-Square value of the group. For example, LODSEQBRHIC was
retained, but BRHIX and ODSBRHIC were dropped. PROC LOGISTIC was run two more
times: once after removing all the variable replications outputting 60 significant variables, which
were then sorted based on Chi-Square values, and again on 20 variables with the largest Chi-
Square values from the previous run. From these 20 variables, the 10 variables with the largest
Chi-Square values, shown below in Table 23, became the predictors of the final model.
Using the above table the model could be built as shown below.
Table 23: Maximum Likelihood Estimates for the Final Model
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
Table 22: Concordance for the Full Model
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%

29

Interpretation of BRPCTSAT using Table 23:
For a 1 unit increase in BRPCTSAT, holding all other variables in the model constant,
the odds of defaulting will decrease by (exp(-2.2634)-1)*100% = 89.6%.
Interpretation of BRPCTSAT using Table 24 below or the odds ratio estimates:
For a 1 unit increase in BRPCTSAT potential customers are 0.104 times more likely of
defaulting.

Table 24: Odds Ratio Estimates and Wald Confidence Intervals for
the Final Model
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%

30

Model Evaluation
Below in Figure 18, the ROC curve for the final model is displayed and a c-statistic of 0.8401 is
found. This indicates that we still have an optimally strong model, considering that the initial
logistic procedure ran with 451 variables resulted in a c-statistic of only 0.04 higher.
In Table 25 below, it can be seen that the percent concordant was 84; meaning that 84% of pairs
were correctly predicted and 16% were not. Again considering that only 10 variables were used
versus the initial 451, simplifying the model by 441 variables for a 4% increase in error seemed
like a fair trade.
The next thing that was investigated was the profitability of the model and how it could be
optimized. One way to do this was by finding a cut-off point for the probability of defaults that
would maximize profit using the classification table in Table 26 below.
Figure 18: ROC Curve for the Final Model
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
Table 25: Concordance for the Final Model
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49

31

Items to be noted in the table above when trying to find an optimal cut-off point were as
follows:
• High specificity – These were the people that were predicted to be good customers and
were actually good. Each incident resulted in a profit increase of $250. This was the
best-case scenario.
• High sensitivity – These were the people that were predicted to be bad customers and
were actually bad. Each incident did not have a direct impact on the profit, but was
considered as a potential loss that was prevented.
• Low false negatives (Type I Error) – These were the people that were predicted to be
good customers and were actually bad. Each incident resulted in a loss of half the credit
line. This was the worst-case scenario.
• Low false positives (Type II Error) – These were the people that were predicted to be bad
customers and were actually good. Again, each incident did not have a direct impact on
the profit but was considered as a lost opportunity.
It appeared that the optimal cut-off would fall somewhere between 0.2 and 0.3 as these
probability levels both have high specificity, relatively high sensitivity and low false negatives.
The percent of false positives was neither low nor high. This however was not of much concern
as it did not directly affect the profitability for the model.
This value was further investigated in Figure 19 below where there is an apparent peak in
profitability per 1,000. The dollar values for the profitability curve were calculated in SAS, using
the profitability criterion mention before, for the 0.1-0.9 cut-offs. As expected from the analysis
of the classification table above, this peak occurred between 0.2 and 0.3. The profitability
between these two values was dissected further to 0.21-0.29 cut-off points as shown in Table 27,
Table 26: Classification Table for the Final Model
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%

32

where a cut point of 0.22 proved to be the most optimal in terms of maximizing profitability at
$113,956.31 per 1,000 customers.
Following this investigation, a profitability table utilizing the most optimal percent cut-off for the
probability of default was generated, shown in Table 28 below.
The results of the above table are as follows:
• ERROR1 – 5.64% of people were predicted to be good customers and were actually bad,
resulting in a loss of $998,708.48 per 1,000 customers or a total loss of $42,400,168.50.
• ERROR2 – 14.29% of people were predicted to be bad customers and were actually good
customers.
Figure 19: Profitability Curve for the Final Model
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
Table 27: Profit per 1,000
for Percent Cut-Off
R-Squared w
Cluster Variable
Own
Cluster
Nex
Clo
1 BRN90P24 0.9285
BRR39P24 0.9285
2 BRCRATE1 0.9533
BROPENEX 0.9523
BRRATE1 0.8853
BRTRADES 0.9521
3 CRATE79 0.9522
TRATE79 0.9523
TRCR39 0.9552
TRCR49 0.9747
TRR49 0.9167
4 TOPEN12 0.8658
TOPEN24 0.8658
5 DCCR49 0.9383
DCCRATE7 0.9841
DCRATE79 0.9843
6 TOPENB50 0.9516
TOPENB75 0.9516
7 BRR324 0.8515
BRRATE3 0.8515
8 BRR4524 0.8852
BRRATE45 0.8852
Variable with th
g 37% Versus 38%
Table 28: Profitability Table for the Final Model
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
Table 27: Profit per
1,000 for Cut-Off Points

33

• VALID1 – 11.96% of people were predicted to be bad customers and were actually bad
customers.
• VALID2 – 68.11% of people were predicted to be good customers and were actually
good, resulting in a profit of $250,000 per 1,000 customers of a total profit of
$85,810,581.50.
Another approach to optimizing profitability is to find a cut-off point using the Kolmogorov-
Smirnov (KS) Test. After sorting the data by probability of default in an ascending fashion, 10
deciles were created. Below in Table 29 the KS values, equal to the difference in cumulative
percentage of goods and the cumulative percentage of bads, are found. Where the difference or
the spread between the two cumulative percentages was the largest, it would result in the largest
KS value, which would be the optimal cut-off point.
Looking at the above output it can be seen that the largest KS value happened at 40%. While
ideally the two cut-off points from the probability function and the KS test should be the same,
the cut-off point for the KS test was almost double the one found using the classification table
and profitability function. The KS values found above were plotted and output in Figure 20
below where the KS value 50.88 is shown to have resulted in the largest spread.
Table 29: Kolmogorov-Smirnov Test for the Final Model
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
Figure 20: Kolmogorov-Smirnov Curve for the Final Model
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7

34

The lift ratio for each of the deciles was then computed by dividing the cumulative percent of
goods by the cumulative decile percentage (e.g. divide the cumulative percent of goods for the
first decile by 0.1, the second by 0.2, etc.). Shown below in Figure 21 are the lift values for each
decile. The lift chart shows how much more likely defaults will be predicted using the final
model versus no model. In this case, the model was 1.2 times more likely to predict the
probability of a default than if a random model or no model was used
In Figure 22 it indicates that the model predicted an 8.13% improvement over a random model
that had a probability of default of 50%.
Figure 21: Lift Chart for the Kolmogorov-Smirnov Test
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
Figure 22: Gains Chart for the Kolmogorov-Smirnov Test
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7

35

Customer Segmentation Analysis
For the customer segmentation analysis, it was decided to cluster 6 different times, one each of the three
different transformations for both the supervised and unsupervised transformations to see if one resulted
in a significantly higher profit per 1,000 customers compared to that of the model. An investigation of the
clusters for the most profitable transformation would also take place to see if there were any notable
differences in spending habits between high profiting customers and low profiting clusters.
In creating the groups of variables for cluster analysis, the top 5 most significant for each transformation
were selected. PROC CLUSTER was run on each set of variables generating three criteria: Cubic
Clustering Criterion (CCC), Pseudo-F and Pseudo T-Squared, which helped determine the optimal
number of clusters. This number was found where there was a peak in the CCC, a peak in the Pseudo-F
and a dip in the Pseudo T-Squared. It can be seen in Figure 23 below, for the log of odds unsupervised
transformation that these events occurred at 5 clusters.

PROC FASTCLUS was then used to cluster the data and confirmed that the number of clusters
specified was appropriate by looking at the cluster summary table that was output and analyzing
the distance between cluster centroids across all of the clusters. In Table 30 below, it can be seen
that the distance between centroids are all about the same. Had one been significantly smaller
than the rest, collapse it with the nearest cluster would have been considered.
Figure 23: Cluster Criteria Analysis for the Log of Odds
(Unsupervised)
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%

36

These clusters can be better visualized using the cluster plot in Figure 24 below.
After defining the clusters, a profitability table was generated so that the profits could be
compared to the final model for predicting GOODBAD. The results for the log of odds
unsupervised transformation are shown in Table 31 below.

Recall that the final model without clustering profited $113,956.31 per 1,000 customers. Looking
at the table above it can be seen than clustering on the 5 most significant variables for the log of
odds unsupervised transformation, profit per 1,000 customers can almost double.
Table 30: Cluster Summary for Log of Odds (Unsupervised)
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
Figure 24: Cluster Plot for Log of Odds (Unsupervised)
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24
5 DCCR49 0.9383 0.6677 0.1857 dccr49
DCCRATE7 0.9841 0.6427 0.0445 dccrate7
DCRATE79 0.9843 0.6436 0.0442 dcrate79
6 TOPENB50 0.9516 0.6277 0.1301 topenb50
TOPENB75 0.9516 0.6268 0.1298 topenb75
7 BRR324 0.8515 0.5044 0.2996 brr324
BRRATE3 0.8515 0.4324 0.2616 brrate3
8 BRR4524 0.8852 0.5106 0.2345 brr4524
BRRATE45 0.8852 0.4901 0.2251 brrate45
g 37% Versus 38%
Table 31: Cluster Profitability for Log of Odds (Unsupervised)
R-Squared with
Cluster Variable
Own
Cluster
Next
Closest 1-R^2 Ratio
Variable
Label
1 BRN90P24 0.9285 0.6214 0.1889 brn90p24
BRR39P24 0.9285 0.6236 0.19 brr39p24
2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1
BRRATE1 0.8853 0.7536 0.4654 brrate1
3 CRATE79 0.9522 0.6085 0.1222 crate79
TRATE79 0.9523 0.6097 0.1221 trate79
TRCR39 0.9552 0.7232 0.1619 trcr39
TRCR49 0.9747 0.704 0.0855 trcr49
TRR49 0.9167 0.8119 0.4427 trr49
4 TOPEN12 0.8658 0.4535 0.2456 topen12
TOPEN24 0.8658 0.3365 0.2023 topen24

37

Limitations and Weaknesses
One of the biggest limitations with this particular model was ease of interpretation with regards
to the variables that comprised it. Because the most mathematically optimal approach was
always exercised without taking into account any tradeoffs that might result in a slightly less
significant but simpler model, it comes to no surprise that the final model was comprised of raw,
ordinal, odds and log of odds transformations. Furthermore, all of the transformed variables that
were retained were unsupervised, making them even harder to explain. Again, the goal for this
model was solely to maximize profit and would suit a client with that sole goal in mind.
However, it is likely not suited for anyone who would like or needs to be able to interpret the
model in relatively simple context.
An example of a weakness in this model was the percentage of Type I Errors, or instances of
predicting that a person would be a good customer when they actually were bad. Though it was
only 5.64% of the data, each one predicted incorrectly was still a loss of half of the credit line. In
this case the total loss was $42,400168.50 or 113,956.51 per 1,00 customers.
Another big weakness was the percentage of Type II Errors, or missed opportunity. A total of
14.29%, or 107,639 people in this dataset were predicted to be bad customers, when actually
they were a good customers. The missed opportunity cost to the company can be calculated by
107,639*250=$26,909,750.

38

Conclusions
While building the most mathematically optimal model may seem like the best approach, as it is always
maximizing profit, other approaches and trade-offs should always be considered. This is very dependent
on what it is that your job/client is looking for and seeing that there was no job/client involved, the
decisions made throughout this whole modeling process were built solely on what was thought to be best.
Looking at the final outcome of the final model, having been built mainly with a mathematically
optimizing mindset, it can be concluded that this exact approach probably would not work for someone
who desired a model that only included variables that were easy to interpret.
In addition the Type I and Type II Errors need to be recognized. The combined loss and potential cost
together added up to about $70,000,000.00. This was a very large and unexpected total and was not noted
until just now in the final analysis. For future analyses it seems appropriate to not just look at outputs,
such as the classification table and profitability function for the most profitable cut-off point and conclude
there, but also to review the consequence of utilizing that particular cut-off point. After calculating the
combined loss and potential cost and finding it to be excessively large, the best thing might be to go with
a more conservative cut-off point. It is always important to discuss these considerations with your client
or employer.

Binary Classification Final

Recommended

Recommended

More Related Content

Similar to Binary Classification Final

Similar to Binary Classification Final (20)

More from Reuben Hilliard

More from Reuben Hilliard (6)

Binary Classification Final