The document analyzes models for predicting loan default using a German credit dataset. It fits generalized linear models, generalized additive models, linear discriminant analysis, and classification trees to the data. Based on out-of-sample testing, the linear discriminant analysis model provided the best results with a minimum misclassification rate of 0.40 and maximum area under the ROC curve of 0.867. However, the performance of all the models was quite similar.
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
German Credit Data Models Compared
1. Executive Summary:
German Credit Data:
The objective of this report is to analyze the various models that can be fitted to the German Credit Score
Dataset. The dataset contains information about the defaulting/non-defaulting criterion for several
companies. The response variable is binary (0/1) which renders this as a typical classification problem.
Each observation is considered as an applicant and each applicant has a chance of repaying the loan (no
default) or not repaying (loss to the bank). The dataset contains 21 variables including the response
variable; qualitative variables viz. Status of checking account, Credit history, Purpose, Savings
account/bonds, Present employment since, Personal status and sex, Other debtors/guarantors, Property,
Other installment plans, Housing, Job, Telephone, Foreign Worker; Numerical Variables viz. Duration in
month, Credit Amount, Installment rate in terms of percentage of disposable income, Present resident
since, Age in years, Number of existing credits at this bank, Number of people liable to provide
maintenance for.
The original dataset contains 1000 observations with 21 variables (including the response variable). The
dataset is split into training and testing using stratified random sampling, where 90% of the entire data is
fixed as training data and the rest 10% is used for testing. (Training dataset = 900 observations, Testing
dataset = 100 observations). The data is then modelled with different types of models to analyze and
obtain the best model by finding the least misclassification rate, area under the ROC curve and the mean
residual deviance.
For generating the misclassification rate, a cost of 5:1 has been specified for False negative : False positive.
Models chosen:
Generalized Linear Model- A generalized linear model is fitted to the training data using binomial family
the logit link.
Generalized Additive Model- A generalized additive model is fitted to the training data using splines.
(Continuous predictor variables are used in the model)
Linear Discriminant Analysis - A model is generated using linear discriminant analysis for predicting the
response variable.
Classification Tree- A Classification tree is fitted to the training data set. The tree is populated, pruned
and tested.
Important Results:
In-Sample Out of sample
Model Type Misclassification Rate AUC Misclassification Rate AUC
GLM – logit regression 0.400 0.829 0.420 0.867
Generalized Additive Model 0.357 0.832 0.400 0.862
Linear Discriminant Analysis 0.620 0.827 0.870 0.867
Classification Tree 0.758 0.819 0.770 0.719
From the above results, it is evident that the LDA model provides the best result with a minimum
misclassification rate and maximum AUC. However, the scores of all models are close to each other.
2. GERMAN CREDIT DATASET
Model 1: Generalized Linear Model
A generalized linear model is fitted to the training dataset using a logistic link. The binary response variable
(Default/Non default) is predicted by considering the responses as binomial probabilities.
The following graphs show the effectiveness of the fit by comparing the fitted values with residuals, scale,
checking for normality of residuals, leverage vs residuals.
The model is then validated with the testing data and the Confusion matrix, misclassification table and
the Area under ROC curve is obtained.
Confusion Matrix & ROC Curve (Out of Sample):
Predicted
Truth 0 1
0 34 27
1 3 36
Cost (5:1)
Out of sample: AUC= 0.867; MR= 0.42 In-sample: AUC= 0.829; MR= 0.40
3. Model 2: Generalized additive model
Unlike the linear model, the generalized additive model can be considered as a non-linear model. Splines
are fitted to each of the predictor variables and they are then used to predict the responses. The splines
are applied only to the continuous predictor variables. The degrees of freedom of each spline depends on
the combination of covariates within the variable.
The model is trained using the training data and the testing data is used for validation of the additive
model. For each of the variables, the spline plot is obtained as shown in figure b2. We see that the
transformations have rendered the variables non-linear. Upon generating the additive model, it is
validated with the training dataset to obtain the ROC curve, AUC and the misclassification rate
Figure b2: Spline plots of variables
4. The model is then validated with the testing data and the Confusion matrix, misclassification table and
the Area under ROC curve is obtained.
Confusion Matrix & ROC Curve (Out of Sample):
Predicted
Truth 0 1
0 36 25
1 3 36
From the confusion matrix, the misclassification rate is calculated. Misclassification rate is the ratio of
(Number of false positives + Number of False negatives) / Total number of observations.
The cost for a false negative is higher than that for a false positive and hence the model is trained towards
reducing false negatives.
Out of Sample : Misclassification rate = 0.400; Area under ROC Curve = 0.862 (Cost – 5:1)
In-Sample : Misclassification rate = 0.357; Area under ROC Curve = 0.832 (Cost – 5:1)
5. Model 3: Linear Discriminant Analysis
Linear discriminant analysis (LDA) is a method to find a linear combination of features that characterizes
or separates two or more classes of objects or events. It is similar to GLMs and GAMs in the sense that
the binary response variables can be predicted by training the model with a training dataset and then
validating it with a testing set.
The coefficients of discriminants are determined for each of the predictor variables and the prior
probabilities for classification of the binary response variable is determined.
As with the previous models, the data is trained using the training dataset and the model is used to predict
the binary outcomes for the training dataset. The area under the curve for the ROC plot and the
misclassification rate are also calculated in order to determine the efficiency of the model.
Confusion Matrix & ROC Curve (Out of Sample):
Predicted
Truth 0 1
0 30 37
1 10 23
Out of sample : Misclassification rate = 0.870; Area under ROC Curve = 0.867 (Cost – 5:1)
In-sample : Misclassification rate = 0.620; Area under ROC Curve = 0.827 (Cost – 5:1)
6. Model 4: Classification Tree:
Another way of creating a model for determining the binary output is by employing a classification tree.
A classification tree is generated with the predictor variables (both continuous and categorical) as input
with each tree node acting as a decision node. The terminals of the classification tree contains the
predicted outputs.
Pruning the tree:
Pruning of the tree is necessary to avoid overfitting and to obtain a minimum average sum squared error.
The initial tree is generated with a Cp value of 0.005 so as to obtain a large tree.
Figure a4: Plot of Cp values with Relative Error
The leftmost point of the graph below the horizontal line (one standard error above the most minimum
value) is chosen as the optimum Cp value. In this case, the leftmost point is a Cp of 0.01
The regression tree is regenerated with this new Cp value, with the training data set. After pruning the
tree, it is then used to predict outcomes using the testing dataset (out-of-sample testing). Upon
prediction, the misclassification rate and the area under the ROC curve is calculated.
7. Visual Representation of the tree
Confusion Matrix & ROC Curve (Out of Sample):
Predicted
Truth 0 1
0 34 27
1 10 29
Out of sample : Misclassification Rate = 0.770; Area under the ROC curve = 0.719
In-sample : Misclassification Rate = 0.758; Area under the ROC curve = 0.819
Conclusion:
Comparing the various models, we see that the generalized linear model with a logistic link and
the Linear Discriminant Analysis model have the best Area under the Curve for the ROC plot.
The LDA model has the least misclassification rate of 0.40. However, there is not a big
differentiation among all the models since they have close misclassification rates and AUCs.
If another stratified sample is chosen, a different model could end up as the best model.