SlideShare a Scribd company logo
1 of 7
Download to read offline
Executive Summary:
German Credit Data:
The objective of this report is to analyze the various models that can be fitted to the German Credit Score
Dataset. The dataset contains information about the defaulting/non-defaulting criterion for several
companies. The response variable is binary (0/1) which renders this as a typical classification problem.
Each observation is considered as an applicant and each applicant has a chance of repaying the loan (no
default) or not repaying (loss to the bank). The dataset contains 21 variables including the response
variable; qualitative variables viz. Status of checking account, Credit history, Purpose, Savings
account/bonds, Present employment since, Personal status and sex, Other debtors/guarantors, Property,
Other installment plans, Housing, Job, Telephone, Foreign Worker; Numerical Variables viz. Duration in
month, Credit Amount, Installment rate in terms of percentage of disposable income, Present resident
since, Age in years, Number of existing credits at this bank, Number of people liable to provide
maintenance for.
The original dataset contains 1000 observations with 21 variables (including the response variable). The
dataset is split into training and testing using stratified random sampling, where 90% of the entire data is
fixed as training data and the rest 10% is used for testing. (Training dataset = 900 observations, Testing
dataset = 100 observations). The data is then modelled with different types of models to analyze and
obtain the best model by finding the least misclassification rate, area under the ROC curve and the mean
residual deviance.
For generating the misclassification rate, a cost of 5:1 has been specified for False negative : False positive.
Models chosen:
Generalized Linear Model- A generalized linear model is fitted to the training data using binomial family
the logit link.
Generalized Additive Model- A generalized additive model is fitted to the training data using splines.
(Continuous predictor variables are used in the model)
Linear Discriminant Analysis - A model is generated using linear discriminant analysis for predicting the
response variable.
Classification Tree- A Classification tree is fitted to the training data set. The tree is populated, pruned
and tested.
Important Results:
In-Sample Out of sample
Model Type Misclassification Rate AUC Misclassification Rate AUC
GLM – logit regression 0.400 0.829 0.420 0.867
Generalized Additive Model 0.357 0.832 0.400 0.862
Linear Discriminant Analysis 0.620 0.827 0.870 0.867
Classification Tree 0.758 0.819 0.770 0.719
From the above results, it is evident that the LDA model provides the best result with a minimum
misclassification rate and maximum AUC. However, the scores of all models are close to each other.
GERMAN CREDIT DATASET
Model 1: Generalized Linear Model
A generalized linear model is fitted to the training dataset using a logistic link. The binary response variable
(Default/Non default) is predicted by considering the responses as binomial probabilities.
The following graphs show the effectiveness of the fit by comparing the fitted values with residuals, scale,
checking for normality of residuals, leverage vs residuals.
The model is then validated with the testing data and the Confusion matrix, misclassification table and
the Area under ROC curve is obtained.
Confusion Matrix & ROC Curve (Out of Sample):
Predicted
Truth 0 1
0 34 27
1 3 36
Cost (5:1)
Out of sample: AUC= 0.867; MR= 0.42 In-sample: AUC= 0.829; MR= 0.40
Model 2: Generalized additive model
Unlike the linear model, the generalized additive model can be considered as a non-linear model. Splines
are fitted to each of the predictor variables and they are then used to predict the responses. The splines
are applied only to the continuous predictor variables. The degrees of freedom of each spline depends on
the combination of covariates within the variable.
The model is trained using the training data and the testing data is used for validation of the additive
model. For each of the variables, the spline plot is obtained as shown in figure b2. We see that the
transformations have rendered the variables non-linear. Upon generating the additive model, it is
validated with the training dataset to obtain the ROC curve, AUC and the misclassification rate
Figure b2: Spline plots of variables
The model is then validated with the testing data and the Confusion matrix, misclassification table and
the Area under ROC curve is obtained.
Confusion Matrix & ROC Curve (Out of Sample):
Predicted
Truth 0 1
0 36 25
1 3 36
From the confusion matrix, the misclassification rate is calculated. Misclassification rate is the ratio of
(Number of false positives + Number of False negatives) / Total number of observations.
The cost for a false negative is higher than that for a false positive and hence the model is trained towards
reducing false negatives.
Out of Sample : Misclassification rate = 0.400; Area under ROC Curve = 0.862 (Cost – 5:1)
In-Sample : Misclassification rate = 0.357; Area under ROC Curve = 0.832 (Cost – 5:1)
Model 3: Linear Discriminant Analysis
Linear discriminant analysis (LDA) is a method to find a linear combination of features that characterizes
or separates two or more classes of objects or events. It is similar to GLMs and GAMs in the sense that
the binary response variables can be predicted by training the model with a training dataset and then
validating it with a testing set.
The coefficients of discriminants are determined for each of the predictor variables and the prior
probabilities for classification of the binary response variable is determined.
As with the previous models, the data is trained using the training dataset and the model is used to predict
the binary outcomes for the training dataset. The area under the curve for the ROC plot and the
misclassification rate are also calculated in order to determine the efficiency of the model.
Confusion Matrix & ROC Curve (Out of Sample):
Predicted
Truth 0 1
0 30 37
1 10 23
Out of sample : Misclassification rate = 0.870; Area under ROC Curve = 0.867 (Cost – 5:1)
In-sample : Misclassification rate = 0.620; Area under ROC Curve = 0.827 (Cost – 5:1)
Model 4: Classification Tree:
Another way of creating a model for determining the binary output is by employing a classification tree.
A classification tree is generated with the predictor variables (both continuous and categorical) as input
with each tree node acting as a decision node. The terminals of the classification tree contains the
predicted outputs.
Pruning the tree:
Pruning of the tree is necessary to avoid overfitting and to obtain a minimum average sum squared error.
The initial tree is generated with a Cp value of 0.005 so as to obtain a large tree.
Figure a4: Plot of Cp values with Relative Error
The leftmost point of the graph below the horizontal line (one standard error above the most minimum
value) is chosen as the optimum Cp value. In this case, the leftmost point is a Cp of 0.01
The regression tree is regenerated with this new Cp value, with the training data set. After pruning the
tree, it is then used to predict outcomes using the testing dataset (out-of-sample testing). Upon
prediction, the misclassification rate and the area under the ROC curve is calculated.
Visual Representation of the tree
Confusion Matrix & ROC Curve (Out of Sample):
Predicted
Truth 0 1
0 34 27
1 10 29
Out of sample : Misclassification Rate = 0.770; Area under the ROC curve = 0.719
In-sample : Misclassification Rate = 0.758; Area under the ROC curve = 0.819
Conclusion:
 Comparing the various models, we see that the generalized linear model with a logistic link and
the Linear Discriminant Analysis model have the best Area under the Curve for the ROC plot.
 The LDA model has the least misclassification rate of 0.40. However, there is not a big
differentiation among all the models since they have close misclassification rates and AUCs.
 If another stratified sample is chosen, a different model could end up as the best model.

More Related Content

What's hot

What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?Smarten Augmented Analytics
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use casesSridhar Ratakonda
 
Linear Regression in R
Linear Regression in RLinear Regression in R
Linear Regression in REdureka!
 
Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selectionAndrea Dal Pozzolo
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?Smarten Augmented Analytics
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...Smarten Augmented Analytics
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...Smarten Augmented Analytics
 
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...Smarten Augmented Analytics
 
Class imbalance problem1
Class imbalance problem1Class imbalance problem1
Class imbalance problem1chs71
 
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?Smarten Augmented Analytics
 
Handling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingHandling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingIRJET Journal
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values Smarten Augmented Analytics
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...Smarten Augmented Analytics
 
Predictive model based on Supervised ML
Predictive model based on Supervised MLPredictive model based on Supervised ML
Predictive model based on Supervised MLUmeshchandraYadav5
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionSiddharth Shrivastava
 
Biostatistics Workshop: Missing Data
Biostatistics Workshop: Missing DataBiostatistics Workshop: Missing Data
Biostatistics Workshop: Missing DataHopkinsCFAR
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced dataSaurabhWani6
 
Anomaly detection- Credit Card Fraud Detection
Anomaly detection- Credit Card Fraud DetectionAnomaly detection- Credit Card Fraud Detection
Anomaly detection- Credit Card Fraud DetectionLipsa Panda
 

What's hot (20)

What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use cases
 
Linear Regression in R
Linear Regression in RLinear Regression in R
Linear Regression in R
 
Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selection
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
 
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
 
Class imbalance problem1
Class imbalance problem1Class imbalance problem1
Class imbalance problem1
 
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
 
Handling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingHandling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random Undersampling
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
 
Predictive model based on Supervised ML
Predictive model based on Supervised MLPredictive model based on Supervised ML
Predictive model based on Supervised ML
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Biostatistics Workshop: Missing Data
Biostatistics Workshop: Missing DataBiostatistics Workshop: Missing Data
Biostatistics Workshop: Missing Data
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
 
Machine learning session1
Machine learning   session1Machine learning   session1
Machine learning session1
 
Anomaly detection- Credit Card Fraud Detection
Anomaly detection- Credit Card Fraud DetectionAnomaly detection- Credit Card Fraud Detection
Anomaly detection- Credit Card Fraud Detection
 

Similar to German Credit Data Models Compared

Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom IndustrySatyam Barsaiyan
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationAsadJaved304231
 
Machine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paperMachine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paperJames by CrowdProcess
 
Regression Analysis and model comparison on the Boston Housing Data
Regression Analysis and model comparison on the Boston Housing DataRegression Analysis and model comparison on the Boston Housing Data
Regression Analysis and model comparison on the Boston Housing DataShivaram Prakash
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classificationSnehaDey21
 
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROBOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROAnthony Kilili
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Dr Athar Khan
 
Predicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian VallesPredicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian VallesAdrián Vallés
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 

Similar to German Credit Data Models Compared (20)

Dm
DmDm
Dm
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
 
SEM
SEMSEM
SEM
 
report
reportreport
report
 
Machine learning project
Machine learning project Machine learning project
Machine learning project
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical information
 
Machine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paperMachine learning in credit risk modeling : a James white paper
Machine learning in credit risk modeling : a James white paper
 
Regression Analysis and model comparison on the Boston Housing Data
Regression Analysis and model comparison on the Boston Housing DataRegression Analysis and model comparison on the Boston Housing Data
Regression Analysis and model comparison on the Boston Housing Data
 
German credit data analysis
German credit data analysisGerman credit data analysis
German credit data analysis
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classification
 
Multiple Regression
Multiple RegressionMultiple Regression
Multiple Regression
 
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROBOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
MidTerm memo
MidTerm memoMidTerm memo
MidTerm memo
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
 
Predicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian VallesPredicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian Valles
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Sem with amos ii
Sem with amos iiSem with amos ii
Sem with amos ii
 

Recently uploaded

vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 

Recently uploaded (20)

vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 

German Credit Data Models Compared

  • 1. Executive Summary: German Credit Data: The objective of this report is to analyze the various models that can be fitted to the German Credit Score Dataset. The dataset contains information about the defaulting/non-defaulting criterion for several companies. The response variable is binary (0/1) which renders this as a typical classification problem. Each observation is considered as an applicant and each applicant has a chance of repaying the loan (no default) or not repaying (loss to the bank). The dataset contains 21 variables including the response variable; qualitative variables viz. Status of checking account, Credit history, Purpose, Savings account/bonds, Present employment since, Personal status and sex, Other debtors/guarantors, Property, Other installment plans, Housing, Job, Telephone, Foreign Worker; Numerical Variables viz. Duration in month, Credit Amount, Installment rate in terms of percentage of disposable income, Present resident since, Age in years, Number of existing credits at this bank, Number of people liable to provide maintenance for. The original dataset contains 1000 observations with 21 variables (including the response variable). The dataset is split into training and testing using stratified random sampling, where 90% of the entire data is fixed as training data and the rest 10% is used for testing. (Training dataset = 900 observations, Testing dataset = 100 observations). The data is then modelled with different types of models to analyze and obtain the best model by finding the least misclassification rate, area under the ROC curve and the mean residual deviance. For generating the misclassification rate, a cost of 5:1 has been specified for False negative : False positive. Models chosen: Generalized Linear Model- A generalized linear model is fitted to the training data using binomial family the logit link. Generalized Additive Model- A generalized additive model is fitted to the training data using splines. (Continuous predictor variables are used in the model) Linear Discriminant Analysis - A model is generated using linear discriminant analysis for predicting the response variable. Classification Tree- A Classification tree is fitted to the training data set. The tree is populated, pruned and tested. Important Results: In-Sample Out of sample Model Type Misclassification Rate AUC Misclassification Rate AUC GLM – logit regression 0.400 0.829 0.420 0.867 Generalized Additive Model 0.357 0.832 0.400 0.862 Linear Discriminant Analysis 0.620 0.827 0.870 0.867 Classification Tree 0.758 0.819 0.770 0.719 From the above results, it is evident that the LDA model provides the best result with a minimum misclassification rate and maximum AUC. However, the scores of all models are close to each other.
  • 2. GERMAN CREDIT DATASET Model 1: Generalized Linear Model A generalized linear model is fitted to the training dataset using a logistic link. The binary response variable (Default/Non default) is predicted by considering the responses as binomial probabilities. The following graphs show the effectiveness of the fit by comparing the fitted values with residuals, scale, checking for normality of residuals, leverage vs residuals. The model is then validated with the testing data and the Confusion matrix, misclassification table and the Area under ROC curve is obtained. Confusion Matrix & ROC Curve (Out of Sample): Predicted Truth 0 1 0 34 27 1 3 36 Cost (5:1) Out of sample: AUC= 0.867; MR= 0.42 In-sample: AUC= 0.829; MR= 0.40
  • 3. Model 2: Generalized additive model Unlike the linear model, the generalized additive model can be considered as a non-linear model. Splines are fitted to each of the predictor variables and they are then used to predict the responses. The splines are applied only to the continuous predictor variables. The degrees of freedom of each spline depends on the combination of covariates within the variable. The model is trained using the training data and the testing data is used for validation of the additive model. For each of the variables, the spline plot is obtained as shown in figure b2. We see that the transformations have rendered the variables non-linear. Upon generating the additive model, it is validated with the training dataset to obtain the ROC curve, AUC and the misclassification rate Figure b2: Spline plots of variables
  • 4. The model is then validated with the testing data and the Confusion matrix, misclassification table and the Area under ROC curve is obtained. Confusion Matrix & ROC Curve (Out of Sample): Predicted Truth 0 1 0 36 25 1 3 36 From the confusion matrix, the misclassification rate is calculated. Misclassification rate is the ratio of (Number of false positives + Number of False negatives) / Total number of observations. The cost for a false negative is higher than that for a false positive and hence the model is trained towards reducing false negatives. Out of Sample : Misclassification rate = 0.400; Area under ROC Curve = 0.862 (Cost – 5:1) In-Sample : Misclassification rate = 0.357; Area under ROC Curve = 0.832 (Cost – 5:1)
  • 5. Model 3: Linear Discriminant Analysis Linear discriminant analysis (LDA) is a method to find a linear combination of features that characterizes or separates two or more classes of objects or events. It is similar to GLMs and GAMs in the sense that the binary response variables can be predicted by training the model with a training dataset and then validating it with a testing set. The coefficients of discriminants are determined for each of the predictor variables and the prior probabilities for classification of the binary response variable is determined. As with the previous models, the data is trained using the training dataset and the model is used to predict the binary outcomes for the training dataset. The area under the curve for the ROC plot and the misclassification rate are also calculated in order to determine the efficiency of the model. Confusion Matrix & ROC Curve (Out of Sample): Predicted Truth 0 1 0 30 37 1 10 23 Out of sample : Misclassification rate = 0.870; Area under ROC Curve = 0.867 (Cost – 5:1) In-sample : Misclassification rate = 0.620; Area under ROC Curve = 0.827 (Cost – 5:1)
  • 6. Model 4: Classification Tree: Another way of creating a model for determining the binary output is by employing a classification tree. A classification tree is generated with the predictor variables (both continuous and categorical) as input with each tree node acting as a decision node. The terminals of the classification tree contains the predicted outputs. Pruning the tree: Pruning of the tree is necessary to avoid overfitting and to obtain a minimum average sum squared error. The initial tree is generated with a Cp value of 0.005 so as to obtain a large tree. Figure a4: Plot of Cp values with Relative Error The leftmost point of the graph below the horizontal line (one standard error above the most minimum value) is chosen as the optimum Cp value. In this case, the leftmost point is a Cp of 0.01 The regression tree is regenerated with this new Cp value, with the training data set. After pruning the tree, it is then used to predict outcomes using the testing dataset (out-of-sample testing). Upon prediction, the misclassification rate and the area under the ROC curve is calculated.
  • 7. Visual Representation of the tree Confusion Matrix & ROC Curve (Out of Sample): Predicted Truth 0 1 0 34 27 1 10 29 Out of sample : Misclassification Rate = 0.770; Area under the ROC curve = 0.719 In-sample : Misclassification Rate = 0.758; Area under the ROC curve = 0.819 Conclusion:  Comparing the various models, we see that the generalized linear model with a logistic link and the Linear Discriminant Analysis model have the best Area under the Curve for the ROC plot.  The LDA model has the least misclassification rate of 0.40. However, there is not a big differentiation among all the models since they have close misclassification rates and AUCs.  If another stratified sample is chosen, a different model could end up as the best model.