SlideShare a Scribd company logo
Advising Faculty: Dr. Priestley
Department of Statistics and Analytical SciencesStep 1
Step 2
A Path to Profit via Logistic Regression
John Michael Croft
The response variable is created from delinquency ID values
ranging from 0 to 9. Where customers have multiple
delinquency IDs, the highest value is used to mitigate default
risk.
• Good = 0 where DelqID < 3.
• Bad =1 where DelqID > 2
• Average ‘default’ rate is 17.57% within our sample data.
PERF CPR
TABLE 23: MODEL VALIDATION COMPARISONS
VALIDATION
DATASET
% Valid 1 % Valid 2 % Type I Error % Type II Error Est. Profit per 1000
TRAINING 11.60% 67.00% 5.93% 15.47% $102,650
1 11.68% 66.85% 5.95% 15.53% $110,432
2 11.69% 66.83% 5.91% 15.61% $110,895
3 11.68% 66.84% 5.90% 15.58% $110,757
4 11.67% 66.82% 5.95% 15.57% $110,473
5 11.72% 66.79% 5.92% 15.58% $110,792
TABLE 20: ANALYSIS OF MAXIMUM LIKELIHOOD ESTIMATES
PARAMETER DF Estimate Standard Wald Pr > ChiSq
Error Chi-Square
ORDRADB6 1 0.4292 0.00349 15147.1948 <.0001
LODSEQTOPENB75 1 0.5552 0.00692 6440.2938 <.0001
CRATE2 1 0.4273 0.00632 4570.2359 <.0001
ORDBRR4524 1 0.8795 0.0104 7218.3086 <.0001
LODSEQBRCRATE1 1 0.7009 0.00851 6774.7374 <.0001
BRCRATE3 1 1.0478 0.0143 5366.704 <.0001
LODSEQBRMINB 1 0.3288 0.00752 1913.1348 <.0001
ODSEQBRAGE 1 3.8637 0.0618 3910.3534 <.0001
LOCINQS 1 0.0862 0.00145 3521.8592 <.0001
BRCRATE4 1 0.4266 0.00998 1829.0249 <.0001
TPCTSAT 1 -1.0071 0.0156 4182.2141 <.0001
ODDSBRRATE2 1 2.0349 0.0484 1764.9455 <.0001
Figure 13: BADPR1 Rank Default Rates DISC 2)avg_dep
0.09
0.10
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.20
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.30
0.31
0.32
0.33
0.34
0.35
Rank for VariableBADPR1
4 5 6 7 8 9
Figure 12: Initial & Final ORDBADPR1 Bin Default Rates (DISC 1)meangoodbad
0.09
0.10
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.20
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.30
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.40
0.41
badpr1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
meangoodbad
0.09
0.10
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.20
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.30
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.40
badpr1
0 1 2 3 4 5 6 7 8 9 10 11 12
TABLE 9: CLUSTER VARIABLE SELECTION
CLUSTER Variable OwnCluster NextClosest (1 - RSquareRatio)
CLUSTER 1 DCCRATE7 .892 .510 .221
DCRATE79 .892 .511 .220
DCR7924 .875 .531 .267
DCN90P24 .844 .620 .412
DCR39P24 .755 .633 .669
DCCR49 .881 .565 .272
CLUSTER 2 BRTRADES .892 .463 .201
BROPEN .800 .584 .482
BRCRATE1 .939 .514 .126
BRRATE1 .903 .507 .197
BRR124 .835 .585 .397
BROPENEX .892 .463 .201
CLUSTER 3 BRWCRATE .619 .347 .584
BRCRATE7 .856 .370 .228
BRRATE79 .855 .370 .230
BRR7924 .568 .365 .680
BRR49 .775 .677 .700
BRCR39 .810 .626 .507
BRCR49 .873 .551 .283
TABLE 7: PROPORTION OF
VARIANCE EXPLAINED
NUMBER OF
CLUSTERS
Proportion of
Variance
Explained
30 .725
31 .733
32 .739
33 .745
34 .753
35 .760
36 .766
37 .772
38 .778
39 .783
40 .788
41 .793
42 .797
43 .802
44 .807
45 .811
Figure 6: Max Standard Deviation of 4 for RBAL and TRADES
0 5000 10000 15000 20000 25000 30000 35000 40000 45000
rbal
0
2
4
6
8
Percent
Distributionof RBAL
0 5 10 15 20 25 30 35 40 45 50 55 60
trades
0
1
2
3
4
5
Percent
Distributionof TRADES
TABLE 2: GOODBAD DISTRIBUTION
GOODBAD FREQUENCY PERCENT
0 1034829 82.43
1 220600 17.57
Step 1: Merge Datasets Step 2: Create Response Variable
Step 3: Median Imputation: Coded Values Step 4: Variable Reduction
Two datasets are merged via an inner join to associate potential
credit performance predictors with delinquency data.
• CPR: 1.44M observations and 339 variables. Each
observation represents a unique customer.
• PERF: 17.24M observations and 18 variables. This file
contains the post hoc performance data.
• Post-Merge dataset has 1.26M observations and 339
potential predictors.
Step 5: Variable Dispersion Reduction Step 6: Variable Clustering
Step 7: Discretization Processes
Step 8: Final Model Predictors
Step 9: Maximize Est. Profit Step 10: Model Validation
Step 3
Many variables contained coded values (e.g. 999 and 999,999)
significantly skewing variable distributions. The increased
number of variables precludes more advanced methods like
regression imputation to be efficient for this process.
To minimize distributional impact, median imputation is
implemented due to the nature of many variables being right
skewed indicating a mean value greater than the median.
Step 4
For efficiency, a macro is implemented that analyzes all
potential predictors to address coded and missing values.
Implementing a cutoff of 40% retains 165 variables where the
combined coded and missing values do not exceed the cutoff.
Specifications .2 or .3 were strongly considered. However due
to the early stage in the process, more variables are retained.
Step 5
The same macro addresses variable dispersion by specifying a
parameter for maximum standard deviation. Any observation
beyond this specification is recoded to the median value as
missing and coded values are above.
As the max standard deviation allowed decreases, more
‘outliers’ are recoded to the median value compressing the right
tail toward the median. Considering many variables have right-
skewed distributions, a max standard deviation of four is chosen
to allow some distributional spread in the right tails.
Step 7
Step 8
Step 9
Step 10
Further reduction of the remaining variables is achieved through
variable cluster analysis. A scatter plot evaluates proportion of
variance explained at the different number of clusters. SAS
recommends 32 clusters as optimal, however 43 clusters are
retained for a desired 80% proportion of variance explained.
Within each cluster, a single variable is selected for optimal
representation. . The optimal variable will have the lowest
1-R^2 ratio within each cluster.
The remaining 44 potential predictor variables are evaluated for
transformations under two separate processes creating ordinal
versions of the variables.
Discretization 1 (DISC 1) is a supervised process distributing
transformed variables into equal width bins. DISC 1 requires
user evaluation for consecutive bin combination decisions from
lack of separation in bin default probabilities and bin
frequencies.
Discretization 2 (DISC2) is an unsupervised process distributing
transformed variables into equal frequency bins. DISC 2 utilizes
a t-test to evaluate consecutive ranks for significance and
combining lower ranks into higher ranks where no significance
is discovered.
Potential Transformations per DISC Technique:
• Ordinal Transformations
• Odds ratio of default
• Log odds ratio of default.
Up to seven potential versions of each variable will be
evaluated for modeling. Not all variable went through both
discretization processes due to multiple reasons. Approximately
250 variables will enter the modeling phase.
The remaining potential predictors are evaluated to logistically
model GOODBAD via a backward selection process (an
iterative process beginning with all potential predictors in the
model and removing the least significant predictor per iteration)
with 70% of the sample data.
This process repeats until all remaining predictors are
significant at the .05 significance level. (i.e. all remaining
predictors’ p-values are less than .05.) The top twelve predictors
from the backward selection logistic regression model.
Estimated profit is optimized using a function provided by the
‘client’ and specifying the cutoff for a high risk default.
Estimated profit is evaluated with two iterations 1) high level
cutoff from 0 to 1 by .1 and 2) a drill down from .1 to .4 by .01.
The optimal cutoff is .21
Profit function: [(250 * correctly predicted ‘GOODs’) – (856 *
incorrectly predicted ‘GOODs’)]
The final model is validated five times randomly sampling the
remaining 30% of the sample data.
Type I and Type II errors appears consistent in all validation
models.
Step 6
1− 𝑅2
𝑅𝑎𝑡𝑖𝑜 =
1− 𝑅2
𝑂𝑤𝑛 𝐶𝑙 𝑢𝑠𝑡𝑒𝑟
1− 𝑅2 𝑁𝑒𝑥𝑡 𝐶𝑙 𝑢𝑠𝑡𝑒𝑟
Potential Next Steps
Compare to a simplified model using original version of final
predictors.
Perform observational clustering and re-optimize the profit
function with cluster specific default probability cutoffs.
Evaluate other imputation methods and selection criteria.

More Related Content

Similar to SAS Day Poster 2016

Measurement System Analysis - Module 2
Measurement System Analysis - Module 2Measurement System Analysis - Module 2
Measurement System Analysis - Module 2
Subhodeep Deb
 
report
reportreport
report
Arthur He
 
Measures of Variation
Measures of Variation Measures of Variation
Measures of Variation
Long Beach City College
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
joycemi_la
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
joycemi_la
 
Msa rr
Msa rrMsa rr
Msa rr
Chai Wei
 
3.2 Measures of variation
3.2 Measures of variation3.2 Measures of variation
3.2 Measures of variation
Long Beach City College
 
Credit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research PaperCredit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research Paper
Garvit Burad
 
Msa 5 day
Msa 5 dayMsa 5 day
Msa 5 day
Jitesh Gaurav
 
MSA
MSAMSA
Measurement systems analysis and a study of anova method
Measurement systems analysis and a study of anova methodMeasurement systems analysis and a study of anova method
Measurement systems analysis and a study of anova method
eSAT Journals
 
Feature Reduction Techniques
Feature Reduction TechniquesFeature Reduction Techniques
Feature Reduction Techniques
Vishal Patel
 
Quality control program 05042018
Quality control program 05042018Quality control program 05042018
Quality control program 05042018
Syed Basheer
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
HariniMS1
 
Internal quality control
Internal quality controlInternal quality control
Internal quality control
Syed Basheer
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ijceronline
 
DSUS_MAO_2012_Jie
DSUS_MAO_2012_JieDSUS_MAO_2012_Jie
DSUS_MAO_2012_Jie
MDO_Lab
 
MSA R&R for training in manufacturing industry
MSA R&R for training in manufacturing industryMSA R&R for training in manufacturing industry
MSA R&R for training in manufacturing industry
abhishek558363
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
Taylor Martell
 
A tale of experiments on bug prediction
A tale of experiments on bug predictionA tale of experiments on bug prediction
A tale of experiments on bug prediction
Martin Pinzger
 

Similar to SAS Day Poster 2016 (20)

Measurement System Analysis - Module 2
Measurement System Analysis - Module 2Measurement System Analysis - Module 2
Measurement System Analysis - Module 2
 
report
reportreport
report
 
Measures of Variation
Measures of Variation Measures of Variation
Measures of Variation
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
Msa rr
Msa rrMsa rr
Msa rr
 
3.2 Measures of variation
3.2 Measures of variation3.2 Measures of variation
3.2 Measures of variation
 
Credit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research PaperCredit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research Paper
 
Msa 5 day
Msa 5 dayMsa 5 day
Msa 5 day
 
MSA
MSAMSA
MSA
 
Measurement systems analysis and a study of anova method
Measurement systems analysis and a study of anova methodMeasurement systems analysis and a study of anova method
Measurement systems analysis and a study of anova method
 
Feature Reduction Techniques
Feature Reduction TechniquesFeature Reduction Techniques
Feature Reduction Techniques
 
Quality control program 05042018
Quality control program 05042018Quality control program 05042018
Quality control program 05042018
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
Internal quality control
Internal quality controlInternal quality control
Internal quality control
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
DSUS_MAO_2012_Jie
DSUS_MAO_2012_JieDSUS_MAO_2012_Jie
DSUS_MAO_2012_Jie
 
MSA R&R for training in manufacturing industry
MSA R&R for training in manufacturing industryMSA R&R for training in manufacturing industry
MSA R&R for training in manufacturing industry
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
 
A tale of experiments on bug prediction
A tale of experiments on bug predictionA tale of experiments on bug prediction
A tale of experiments on bug prediction
 

More from John Michael Croft

SAS Data Mining - Crime Modeling
SAS Data Mining - Crime ModelingSAS Data Mining - Crime Modeling
SAS Data Mining - Crime Modeling
John Michael Croft
 
Final Project Final Doc
Final Project Final DocFinal Project Final Doc
Final Project Final Doc
John Michael Croft
 
HW7 Memo
HW7 MemoHW7 Memo
Sweden Final Copy
Sweden Final CopySweden Final Copy
Sweden Final Copy
John Michael Croft
 
Regression Analysis of SAT Scores Final
Regression Analysis of SAT Scores FinalRegression Analysis of SAT Scores Final
Regression Analysis of SAT Scores Final
John Michael Croft
 
Regression Analysis of NBA Points Final
Regression Analysis of NBA Points  FinalRegression Analysis of NBA Points  Final
Regression Analysis of NBA Points Final
John Michael Croft
 
Final NBA Power Point
Final NBA Power PointFinal NBA Power Point
Final NBA Power Point
John Michael Croft
 
River Forest ppoint for Lenders
River Forest ppoint for LendersRiver Forest ppoint for Lenders
River Forest ppoint for Lenders
John Michael Croft
 
River Forest ppoint for investors
River Forest ppoint for investorsRiver Forest ppoint for investors
River Forest ppoint for investors
John Michael Croft
 
Econ club by laws
Econ club by lawsEcon club by laws
Econ club by laws
John Michael Croft
 

More from John Michael Croft (10)

SAS Data Mining - Crime Modeling
SAS Data Mining - Crime ModelingSAS Data Mining - Crime Modeling
SAS Data Mining - Crime Modeling
 
Final Project Final Doc
Final Project Final DocFinal Project Final Doc
Final Project Final Doc
 
HW7 Memo
HW7 MemoHW7 Memo
HW7 Memo
 
Sweden Final Copy
Sweden Final CopySweden Final Copy
Sweden Final Copy
 
Regression Analysis of SAT Scores Final
Regression Analysis of SAT Scores FinalRegression Analysis of SAT Scores Final
Regression Analysis of SAT Scores Final
 
Regression Analysis of NBA Points Final
Regression Analysis of NBA Points  FinalRegression Analysis of NBA Points  Final
Regression Analysis of NBA Points Final
 
Final NBA Power Point
Final NBA Power PointFinal NBA Power Point
Final NBA Power Point
 
River Forest ppoint for Lenders
River Forest ppoint for LendersRiver Forest ppoint for Lenders
River Forest ppoint for Lenders
 
River Forest ppoint for investors
River Forest ppoint for investorsRiver Forest ppoint for investors
River Forest ppoint for investors
 
Econ club by laws
Econ club by lawsEcon club by laws
Econ club by laws
 

SAS Day Poster 2016

  • 1. Advising Faculty: Dr. Priestley Department of Statistics and Analytical SciencesStep 1 Step 2 A Path to Profit via Logistic Regression John Michael Croft The response variable is created from delinquency ID values ranging from 0 to 9. Where customers have multiple delinquency IDs, the highest value is used to mitigate default risk. • Good = 0 where DelqID < 3. • Bad =1 where DelqID > 2 • Average ‘default’ rate is 17.57% within our sample data. PERF CPR TABLE 23: MODEL VALIDATION COMPARISONS VALIDATION DATASET % Valid 1 % Valid 2 % Type I Error % Type II Error Est. Profit per 1000 TRAINING 11.60% 67.00% 5.93% 15.47% $102,650 1 11.68% 66.85% 5.95% 15.53% $110,432 2 11.69% 66.83% 5.91% 15.61% $110,895 3 11.68% 66.84% 5.90% 15.58% $110,757 4 11.67% 66.82% 5.95% 15.57% $110,473 5 11.72% 66.79% 5.92% 15.58% $110,792 TABLE 20: ANALYSIS OF MAXIMUM LIKELIHOOD ESTIMATES PARAMETER DF Estimate Standard Wald Pr > ChiSq Error Chi-Square ORDRADB6 1 0.4292 0.00349 15147.1948 <.0001 LODSEQTOPENB75 1 0.5552 0.00692 6440.2938 <.0001 CRATE2 1 0.4273 0.00632 4570.2359 <.0001 ORDBRR4524 1 0.8795 0.0104 7218.3086 <.0001 LODSEQBRCRATE1 1 0.7009 0.00851 6774.7374 <.0001 BRCRATE3 1 1.0478 0.0143 5366.704 <.0001 LODSEQBRMINB 1 0.3288 0.00752 1913.1348 <.0001 ODSEQBRAGE 1 3.8637 0.0618 3910.3534 <.0001 LOCINQS 1 0.0862 0.00145 3521.8592 <.0001 BRCRATE4 1 0.4266 0.00998 1829.0249 <.0001 TPCTSAT 1 -1.0071 0.0156 4182.2141 <.0001 ODDSBRRATE2 1 2.0349 0.0484 1764.9455 <.0001 Figure 13: BADPR1 Rank Default Rates DISC 2)avg_dep 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 Rank for VariableBADPR1 4 5 6 7 8 9 Figure 12: Initial & Final ORDBADPR1 Bin Default Rates (DISC 1)meangoodbad 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 badpr1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 meangoodbad 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 badpr1 0 1 2 3 4 5 6 7 8 9 10 11 12 TABLE 9: CLUSTER VARIABLE SELECTION CLUSTER Variable OwnCluster NextClosest (1 - RSquareRatio) CLUSTER 1 DCCRATE7 .892 .510 .221 DCRATE79 .892 .511 .220 DCR7924 .875 .531 .267 DCN90P24 .844 .620 .412 DCR39P24 .755 .633 .669 DCCR49 .881 .565 .272 CLUSTER 2 BRTRADES .892 .463 .201 BROPEN .800 .584 .482 BRCRATE1 .939 .514 .126 BRRATE1 .903 .507 .197 BRR124 .835 .585 .397 BROPENEX .892 .463 .201 CLUSTER 3 BRWCRATE .619 .347 .584 BRCRATE7 .856 .370 .228 BRRATE79 .855 .370 .230 BRR7924 .568 .365 .680 BRR49 .775 .677 .700 BRCR39 .810 .626 .507 BRCR49 .873 .551 .283 TABLE 7: PROPORTION OF VARIANCE EXPLAINED NUMBER OF CLUSTERS Proportion of Variance Explained 30 .725 31 .733 32 .739 33 .745 34 .753 35 .760 36 .766 37 .772 38 .778 39 .783 40 .788 41 .793 42 .797 43 .802 44 .807 45 .811 Figure 6: Max Standard Deviation of 4 for RBAL and TRADES 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 rbal 0 2 4 6 8 Percent Distributionof RBAL 0 5 10 15 20 25 30 35 40 45 50 55 60 trades 0 1 2 3 4 5 Percent Distributionof TRADES TABLE 2: GOODBAD DISTRIBUTION GOODBAD FREQUENCY PERCENT 0 1034829 82.43 1 220600 17.57 Step 1: Merge Datasets Step 2: Create Response Variable Step 3: Median Imputation: Coded Values Step 4: Variable Reduction Two datasets are merged via an inner join to associate potential credit performance predictors with delinquency data. • CPR: 1.44M observations and 339 variables. Each observation represents a unique customer. • PERF: 17.24M observations and 18 variables. This file contains the post hoc performance data. • Post-Merge dataset has 1.26M observations and 339 potential predictors. Step 5: Variable Dispersion Reduction Step 6: Variable Clustering Step 7: Discretization Processes Step 8: Final Model Predictors Step 9: Maximize Est. Profit Step 10: Model Validation Step 3 Many variables contained coded values (e.g. 999 and 999,999) significantly skewing variable distributions. The increased number of variables precludes more advanced methods like regression imputation to be efficient for this process. To minimize distributional impact, median imputation is implemented due to the nature of many variables being right skewed indicating a mean value greater than the median. Step 4 For efficiency, a macro is implemented that analyzes all potential predictors to address coded and missing values. Implementing a cutoff of 40% retains 165 variables where the combined coded and missing values do not exceed the cutoff. Specifications .2 or .3 were strongly considered. However due to the early stage in the process, more variables are retained. Step 5 The same macro addresses variable dispersion by specifying a parameter for maximum standard deviation. Any observation beyond this specification is recoded to the median value as missing and coded values are above. As the max standard deviation allowed decreases, more ‘outliers’ are recoded to the median value compressing the right tail toward the median. Considering many variables have right- skewed distributions, a max standard deviation of four is chosen to allow some distributional spread in the right tails. Step 7 Step 8 Step 9 Step 10 Further reduction of the remaining variables is achieved through variable cluster analysis. A scatter plot evaluates proportion of variance explained at the different number of clusters. SAS recommends 32 clusters as optimal, however 43 clusters are retained for a desired 80% proportion of variance explained. Within each cluster, a single variable is selected for optimal representation. . The optimal variable will have the lowest 1-R^2 ratio within each cluster. The remaining 44 potential predictor variables are evaluated for transformations under two separate processes creating ordinal versions of the variables. Discretization 1 (DISC 1) is a supervised process distributing transformed variables into equal width bins. DISC 1 requires user evaluation for consecutive bin combination decisions from lack of separation in bin default probabilities and bin frequencies. Discretization 2 (DISC2) is an unsupervised process distributing transformed variables into equal frequency bins. DISC 2 utilizes a t-test to evaluate consecutive ranks for significance and combining lower ranks into higher ranks where no significance is discovered. Potential Transformations per DISC Technique: • Ordinal Transformations • Odds ratio of default • Log odds ratio of default. Up to seven potential versions of each variable will be evaluated for modeling. Not all variable went through both discretization processes due to multiple reasons. Approximately 250 variables will enter the modeling phase. The remaining potential predictors are evaluated to logistically model GOODBAD via a backward selection process (an iterative process beginning with all potential predictors in the model and removing the least significant predictor per iteration) with 70% of the sample data. This process repeats until all remaining predictors are significant at the .05 significance level. (i.e. all remaining predictors’ p-values are less than .05.) The top twelve predictors from the backward selection logistic regression model. Estimated profit is optimized using a function provided by the ‘client’ and specifying the cutoff for a high risk default. Estimated profit is evaluated with two iterations 1) high level cutoff from 0 to 1 by .1 and 2) a drill down from .1 to .4 by .01. The optimal cutoff is .21 Profit function: [(250 * correctly predicted ‘GOODs’) – (856 * incorrectly predicted ‘GOODs’)] The final model is validated five times randomly sampling the remaining 30% of the sample data. Type I and Type II errors appears consistent in all validation models. Step 6 1− 𝑅2 𝑅𝑎𝑡𝑖𝑜 = 1− 𝑅2 𝑂𝑤𝑛 𝐶𝑙 𝑢𝑠𝑡𝑒𝑟 1− 𝑅2 𝑁𝑒𝑥𝑡 𝐶𝑙 𝑢𝑠𝑡𝑒𝑟 Potential Next Steps Compare to a simplified model using original version of final predictors. Perform observational clustering and re-optimize the profit function with cluster specific default probability cutoffs. Evaluate other imputation methods and selection criteria.