1. StephanusDarmapuspita
Eric Esajian
DSO 528
Executive Summary
Objective: Trojan Financial Services has offered thousands of lines of home loans to their
clients. Many of these lines of credit have defaulted on their loans. The objective here is to
build a model to predict whether or not an applicant will default on their loan. Currently,
there are 10,000 loan applications per month that the approver receives. Thus, the model will
aid the loan officer and help Trojan Financial decide whether or not to approve, or deny the
loan based on a number of decision variables that are given. Every loan that is offered, then
defaulted on, costs the company around $20,000 on average. Therefore, a sound logistic
regression model needs to be constructed in order to identify which applicants are likely to
default on their home equity line of credit issued by Trojan Financial.
JMP Model: The JMP model will consider a number of variables to fit this model. It will
focus on those variables that predict whether a loan WILL default, as opposed to whether it
will the loan will be successful.
Key Insights: It is imperative that only the variables that will positively affect this model be
chosen. Those variables must have a significance level greater than 0.05, though even lower
p-values are more ideal. However, it must be kept in mind that in addition to variable with
low value, we should use business insight to consider other variable that can improve the
prediction even though it has slightly higher p-value. The general idea is to select variables
that will predict whether the loan will default which costs Trojan Financial on average
$20,000 a loan. The more variables we add that are significant, the greater the chance of
finding a better model.
Your Best Model: Our best JMP model used the “Bad” as the Y-axis. Since we want to
predict which loans will be bad. As for the x-intercepts, our best model included the
following variables: Reason[DebtCon], Derog, Delinq, Debtinc, Log Age, and Loan/Value.
These variables provided us with the strongest model. The R-square was reasonable, though
not ideal at 0.22, and the probably of most of our variables we used is greater than 0.05, four
of which were actually less than 0.001. The formula we used to narrow down the probability
of default was the following if statement: If the probability of a bad loan is greater than 0.22,
then we will give it a 1. If not, we will give 0. With all this set in place, we were able to
derive over $2.4 million in training data, and well over $2.5 million in testing data.
Key Changes Made: Some of the key changes made included looking at the probability at a
higher perspective. The default probability was typically set at 0.15, however, it was found
that a probability 0.22 gave the model higher total profit.
Key Insights: Some key insights that were found included the amount of variables that are
used in this model. Standard thought tells us that the less variables used, would provide us
with a better, simpler model. This however was not the case. What was found was that the
more variables that were included in the model, the better the profit as well as a higher R-
square.
2. StephanusDarmapuspita
Eric Esajian
DSO 528
Why your model is better? This model is better because it will provide Trojan Financial
Services between 24 and 25 million dollars. The overall model show very low p-value and
the r-square of the overall model is around was 0.22, so we are confident that this will capture
all the bad loans, with keeping all the profitable lines of credit. One unique characteristic
about this model is that both false positives are extremely low, around 4%. This means we
can safely say that this model will do a good job of staying true, and that we have a very low
chance of being wrong with the variables chosen.
What is the lift (as a ratio) provided by your model compared to Baseline Model for
both training and testing? What is the increase in net dollar amount compared to the
Baseline Model for both training and testing? This model’s lift with respect to the baseline
JMP Model was 1.44 for the training data, and 0.0005 for the testing data. Under the
propensity table, it is shown that the lift with respect to the baseline, our training data is at
9.7, as opposed to no lift for the testing data.
Conclusion: We believe that this is good model because we used the variables that were
given and added in two new variables which proved to be significant in improving the total
profit. JMP Go has done a good job, but if we add more of a business perspective into this,
we may be able to come up with a better solution. We could probably do a better job beating
JMP by adding in more variables that are based on tribal knowledge, and could have used
less of the variables that seemed to be insignificant.
Base Model
i) Statistical KPIs of JMP Model – From JMP Printout
Measure Training Validation Definition
Entropy RSquare 0.2180 0.2641 1-Loglike(model)/Loglike(0)
GeneralizedRSquare 0.2751 0.3253
(1-(L(0)/L(model))^(2/n))/(1-
L(0)^(2/n))
Mean -Log p 0.2490 0.2226 ∑ -Log(ρ[j])/n
RMSE 0.2576 0.2420 √ ∑(y[j]-ρ[j])²/n
Mean Abs Dev 0.1351 0.1202 ∑ |y[j]-ρ[j]|/n
Misclassification
Rate 0.08 0.0720 ∑ (ρ[j]≠ρMax)/n
N 1000 1000 n
ii) Statistical KPIs of JMP Model – From Excel Printout
Other Metrics Training Validation
Accuracy % 86.80% 86.70%
3. StephanusDarmapuspita
Eric Esajian
DSO 528
True Positive Rate 50.52% 56.67%
False PositiveRate 9.30% 10.33%
Sensitivity( True PositiveRate) 50.52% 56.67%
Specificity(True Negative Rate) 90.70% 89.67%
iii) a) Business KPIs of JMP Model – Training
Predictednumberof GoodLoans = 8670
Upper limitforLoans = 10000
Actual numberof approvedloans = 8670
Propensityof GoodLoan = 94.464%
Propensityof BadLoan = 5.536%
Total Profit = $ 23,160,000
b) Business KPIs of JMP Model – Testing
Predictednumberof GoodLoans = 8550
Upper limitforLoans = 10000
Actual numberof approvedloans = 8550
Propensityof GoodLoan = 95.439%
Propensityof BadLoan = 4.561%
Total Profit = $ 24,840,000
iv) Interpret the Model (Base Logistic Regression) – From Business Point of view &
Statistical Point of view
From business point of view, JMP has done a good job identifying the variables that are
indeed significant to determine whether loan candidate will likely be a good loan or bad loan.
Variables such as DEROG, DELINQ, and DEBTINC indeed indicate whether loan candidate
will lead to good or bad loan. However, from business perspective, it will be foolish to
determine whether someone will be a good or bad loan candidate just from these 4 variables.
There are many other variables that can be considered such as the reason of getting Loan and
also the amount of total Loan with respect to the total value of the home. For example, people
4. StephanusDarmapuspita
Eric Esajian
DSO 528
that seek loan to pay out other loan obviously more likely to default in its new loan compare
to people that seek loan to do home improvement.
From statistical point of view, all variables that are chosen are statistically significant and we
think that this is a good think from statistical point of view. However, we note that CLAGE
variable show a skewed distribution. We might need to do something about this variable in
order to get a better regression result.
v) Confusion Matrix for Training (cut off probability of 0.15)
Predicted
GoodLoan BadLoan
GoodLoan 819 84 903
Actual BadLoan 48 49 97
867 133 1000
vi) Confusion Matrix for Testing
Predicted
GoodLoan BadLoan
GoodLoan 816 94 910
Actual BadLoan 39 51 90
855 145 1000
v) Lift Table
Lift Table in Dollars Training Testing
Lift with respect to Baseline - JMP Model 1.385167464 1.485645933
Lift with respect to Baseline - My Best Model 1.437799043 1.523923445
Lift with respect to JMP Model - My
Contribution 1.037996546 1.100172712
Overall Lift with respect to Baseline -My Best
Model 1.437799043 1.523923445
Lift Table in Propensity Training Testing
Lift with respect to Baseline - JMP Model 9.738522456 9.839030566
Lift with respect to Baseline - My Best Model 9.714728021 9.789862355
vii) Attach JMP Printout
6. StephanusDarmapuspita
Eric Esajian
DSO 528
Testing ConfusionMatrix
Best Model
i) Statistical KPIs of JMP Model – From JMP Printout
Measure Training Validation Definition
Entropy RSquare 0.2214 0.1758 1-Loglike(model)/Loglike(0)
GeneralizedRSquare
0.2792 0.2222 (1-(L(0)/L(model))^(2/n))/(1-
L(0)^(2/n))
Mean -Log p 0.2479 0.2494 ∑ -Log(ρ[j])/n
RMSE 0.2570 0.26 √ ∑(y[j]-ρ[j])²/n
7. StephanusDarmapuspita
Eric Esajian
DSO 528
Mean Abs Dev 0.1342 0.1353 ∑ |y[j]-ρ[j]|/n
Misclassification
Rate
0.0790 0.0790
∑ (ρ[j]≠ρMax)/n
N 1000 1000 n
ii) Statistical KPIs of JMP Model – From Excel Printout
Other Metrics Training Validation
Accuracy % 91.00% 91.10%
True Positive Rate 45.36% 48.89%
False PositiveRate 4.10% 4.73%
Sensitivity( True PositiveRate) 45.36% 48.89%
Specificity(True Negative Rate) 95.90% 95.27%
iii) a)Business KPIs of JMP Model – Training
Predictednumberof GoodLoans = 9190
Upper limitforLoans = 10000
Actual numberof approvedloans = 9190
Propensityof GoodLoan = 94.233%
Propensityof BadLoan = 5.767%
Total Profit = $ 24,040,000
b) Business KPIs of JMP Model – Testing
Predictednumberof GoodLoans = 9130
Upper limitforLoans = 10000
Actual numberof approvedloans = 9130
Propensityof GoodLoan = 94.962%
Propensityof BadLoan = 5.038%
Total Profit = $ 25,480,000
iv) Interpret the Model (MyBest Model) – From Business Point of view & Statistical
Point of view
8. StephanusDarmapuspita
Eric Esajian
DSO 528
From business perspective, we decided that we should definitely include DEROG, DEBTINC,
and DELINQ variables as it is significant from business perspective. In addition to that, we
decided to include REASON as we think that it is one of the variables that are significant
from business perspective. As explained, loan candidates are more likely to default on loan if
the motive to take loan is for refinancing of debt rather than to do home improvement. Total
loan to value ratio also play important role. From business perspective, very high loan to
value ratio means that it is more likely for people to simply default from loan and let the
foreclosure happen if the value of their home fall way below the loan value itself.
From statistical point of view, we decided to do log operation on CLAGE to normalize it. It is
done under “Log Age” column. It turned out that this variable is statistically significant as
well. Combination of the statistic and business insight from above allow us to produce better
model that beat the JMP go-option stepwise regression at 1.04 lift on training and 1.1 lift on
testing.
v) Confusion Matrix for Training (cut off probability of 0.22)
Predicted
GoodLoan BadLoan
GoodLoan 866 37 903
Actual BadLoan 53 44 97
919 81 1000
vi) Confusion Matrix for Testing
Predicted
GoodLoan BadLoan
GoodLoan 867 43 910
Actual BadLoan 46 44 90
913 87 1000
v) Lift Table
Lift Table in Dollars Training Testing
Lift with respect to Baseline - JMP Model 1.385167464 1.485645933
Lift with respect to Baseline - My Best Model 1.437799043 1.523923445
Lift with respect to JMP Model - My
Contribution 1.037996546 1.100172712
Overall Lift with respect to Baseline -My Best
Model 1.437799043 1.523923445
Lift Table in Propensity Training Testing
Lift with respect to Baseline - JMP Model 9.738522456 9.839030566
Lift with respect to Baseline - My Best Model 9.714728021 9.789862355
vii) Attach JMP Printout (JMP Print out with some explanation is as per below)
9. StephanusDarmapuspita
Eric Esajian
DSO 528
Nominal Logistic Fit for BAD
Converged in Gradient, 7 iterations
Whole Model Test
Model -LogLikelihood DF ChiSquare Prob>ChiSq
Difference 70.49985 6 140.9997 <.0001*
Full 247.94100
Reduced 318.44085
With the best model that we chose, we can see that the probability of this model to be
true is high, with respect to the p-value where the probability is <0.0001.
RSquare (U) 0.2214
AICc 509.995
BIC 544.236
Observations (or Sum Wgts) 1000
The RSquare here is acceptable although it may not be ideal. The higher the RSquare
the better typically, however in this case where there are over 10,000 approvals that
need to take place every month, and less than 10 variables we can choose from, we
believe that the score of 0.22 is acceptable.
Measure Training Definition
Entropy RSquare 0.2214 1-Loglike(model)/Loglike(0)
Generalized RSquare 0.2792 (1-(L(0)/L(model))^(2/n))/(1-L(0)^(2/n))
Mean -Log p 0.2479 ∑ -Log(ρ[j])/n
RMSE 0.2570 √ ∑(y[j]-ρ[j])²/n
Mean Abs Dev 0.1342 ∑ |y[j]-ρ[j]|/n
Misclassification Rate 0.0790 ∑ (ρ[j]≠ρMax)/n
N 1000 n
Lack Of Fit
Source DF -LogLikelihood ChiSquare
Lack Of Fit 993 247.94100 495.882
Saturated 999 0.00000 Prob>ChiSq
Fitted 6 247.94100 1.0000
Parameter Estimates
Term Estimate Std Error ChiSquare Prob>ChiSq
Intercept 0.50243494 1.416252 0.13 0.7228
REASON[DebtCon] -0.0903551 0.1352828 0.45 0.5042
DEROG -0.8309048 0.1616376 26.43 <.0001*
DELINQ -0.6897292 0.1260756 29.93 <.0001*
DEBTINC -0.0859525 0.0165176 27.08 <.0001*
Log Age 2.20439479 0.5301889 17.29 <.0001*
Loan/Value 0.67044572 0.6135404 1.19 0.2745