Credit Risk Assessment using Machine Learning Techniques with WEKA

German Credit Data
Members
Mehnaz Newaz, mnewaz@ryerson.ca
Mashﬁq Shahriar, mshahriar@ryerson.ca
Summary: The goal of this projects is to obtain a machine learning model to perform credit scoring. Credit risk
assessment of an applicant is vital to the banking sector. There are 20 aEributes used in judging a loan applicant( 7
numerical, 13 categorical/nominal). The goal is to classify the applicant in one of two categories. Good or Bad. A few
Algorithms were tested out for accuracy and the best model was chosen to predict the data which was was analyzed to
see how well the model predicted.
Workload Distribution:
In this secQon, you need to menQon who did what in the project.
Member Name List of Tasks Performed
Mehnaz Newaz(50%)
(EQUAL DISTRIBUTION OF WORK)
Data Preparation
Predictive Modeling/ClassiSication
Post-prediction Analysis
Conclusions and Recommendations
MashSiq Shahriar(50%)
(EQUAL DISTRIBUTION OF WORK)
Data Preparation
Post-prediction Analysis
Conclusions and Recommendations

Data Preparation:
I. Look at the aEribute type; e.g., nominal, ordinal or quanQtaQve.
@relaQon 'german_credit-weka.ﬁlters.unsupervised.aEribute.Reorder-
R2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,1'
◦ @aEribute 'Account Balance' {1,2,3,4}
◦ @aEribute 'DuraQon of Credit (month)' numeric
◦ @aEribute 'Payment Status of Previous Credit' {0,1,2,3,4}
◦ @aEribute Purpose {0,1,2,3,4,5,6,8,9,10}
◦ @aEribute 'Credit Amount' numeric
◦ @aEribute 'Value Savings/Stocks' {1,2,3,4,5}
◦ @aEribute 'Length of current employment' {1,2,3,4,5}
◦ @aEribute 'Instalment per cent' numeric
◦ @aEribute 'Sex & Marital Status' {1,2,3,4}
◦ @aEribute Guarantors {1,2,3}
◦ @aEribute 'DuraQon in Current address' {1,2,3,4}
◦ @aEribute 'Most valuable available asset' {1,2,3,4}
◦ @aEribute 'Age (years)' numeric
◦ @aEribute 'Concurrent Credits' {1,2,3}
◦ @aEribute 'Type of apartment' {1,2,3}
◦ @aEribute 'No of Credits at this Bank' numeric
◦ @aEribute OccupaQon {1,2,3,4}
◦ @aEribute 'No of dependents' numeric
◦ @aEribute Telephone {1,2}
◦ @aEribute 'Foreign Worker' {1,2}
◦ @aEribute Creditability {0,1}
II. Find max, min, mean and standard deviaQon of aEributes.
III. Determine any outlier values (records) for each of the aEributes or aEributes underconsideraQon are as follows
for the ones that did have outliers.
```{r data,echo=FALSE,message=FALSE,warning=FALSE}
## ImporQng packages
library(MASS)
library(car)
library(caret)
library(randomForest)
library(ROCR)
library(e1071)
## Loading data into environment
data <- read.csv("~/Desktop/FinalProject/CIND119/german_credit_card/german_credit.csv",header=
T,stringsAsFactors = F, na.strings = c("","NA"))
german=c("CreditStatus","Checking_Status","DuraRon","Credit_history", "Purpose" ,"Credit_amount",
"Savings_status", "Employment", "Installment_Commitment", "Personal_status","Other_parRes", "Residence_since",
"Property_Magnitude", "Age", "Other_payment_plans" ,"Housing",
"ExisRng_credits","Job","Num_dependents","Own_telephone", "Foreign_worker" )
names(data) = german
#Variables names in dataset
names(data)
str(data)

table(data$credit.CreditStatus)
table(data$credit.CreditStatus)/nrow(data)
summary(data)
summary(data$DuraRon)
summary(data$Age)
summary(data$Credit_amount )
boxplot(data$CreditStatus)
boxplot(data$Checking_Status)
boxplot(data$DuraRon)
boxplot(data$Credit_history)
boxplot(data$Purpose)
boxplot(data$Credit_amount)
boxplot(data$Savings_status)
boxplot(data$Employment)
boxplot(data$Installment_Commitment)
boxplot(data$Personal_status)
boxplot(data$Other_parRes)
boxplot(data$Residence_since)
boxplot(data$Property_Magnitude)
boxplot(data$Age)
boxplot(data$Other_payment_plans)
boxplot(data$Housing)
boxplot(data$ExisRng_credits)
boxplot(data$Job)
boxplot(data$Num_dependents)
boxplot(data$Own_telephone)
boxplot(data$Foreign_worker)
boxplot.stats(data$CreditStatus)
boxplot.stats(data$Checking_Status)
boxplot.stats(data$DuraRon)
boxplot.stats(data$Credit_history)
boxplot.stats(data$Purpose)
boxplot.stats(data$Credit_amount)
boxplot.stats(data$Savings_status)
boxplot.stats(data$Employment)
boxplot.stats(data$Installment_Commitment)
boxplot.stats(data$Personal_status)
boxplot.stats(data$Other_parRes)
boxplot.stats(data$Residence_since)
boxplot.stats(data$Property_Magnitude)
boxplot.stats(data$Age)
boxplot.stats(data$Other_payment_plans)
boxplot.stats(data$Housing)
boxplot.stats(data$ExisRng_credits)
boxplot.stats(data$Job)
boxplot.stats(data$Num_dependents)
boxplot.stats(data$Own_telephone)
boxplot.stats(data$Foreign_worker)
cor(data)
write.csv(data, "ﬁnal.csv")
```
> boxplot.stats(data$Duration)
$stats
[1] 4 12 18 24 42
$n

[1] 1000
$conf
[1] 17.40043 18.59957
$out
[1] 48 48 48 48 48 48 48 48 47 48 60 54 48 48 60 48 60 48 48 60 48 48 48 48 48 48 60 60 45
[30] 48 48 48 72 48 60 60 48 60 60 48 60 48 48 48 60 45 48 48 48 48 48 48 48 60 48 48 48 48
[59] 48 48 45 48 54 48 45 48 45 48 48 48
> boxplot.stats(data$Purpose)
$stats
[1] 0 1 2 3 6
$n
[1] 1000
$conf
[1] 1.900072 2.099928
$out
[1] 9 10 10 9 9 9 9 9 9 9 9 9 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
[30] 8 9 9 9 8 9 9 9 8 9 9 9 10 9 9 9 9 9 9 9 9 9 9 10 9 9 8 9 9
[59] 9 9 9 9 8 9 9 9 9 9 10 10 8 9 9 10 9 9 8 9 9 9 9 9 9 9 9 9 9
[88] 9 9 9 9 9 10 10 10 9 9 9 9 10 9 9 9 9 8 9 9 9 9 9 9 9 9 9 9 10
[117] 9 9
> boxplot.stats(data$Credit_amount)
$stats
[1] 250.0 1365.0 2319.5 3972.5 7882.0
$n
[1] 1000
$conf
[1] 2189.219 2449.781
$out
[1] 10875 8858 12749 8072 8487 12169 10722 8613 8588 10366 8133 9436 10477 13756
[15] 11760 14179 10974 9566 8358 9857 10222 9055 7966 12204 8229 10623 9277 15857
[29] 10144 15653 8335 8471 8947 11054 9157 9283 14555 9271 8386 14318 15672 10961
[43] 7980 11560 11328 11938 14782 12612 9398 9572 8065 9034 14027 9629 12976 10297
[57] 14421 8086 10127 12389 11590 15945 9960 8648 8318 11816 11998 18424 14896 8978
[71] 12579 12680
> boxplot.stats(data$Other_parties)
$stats
[1] 1 1 1 1 1
$n
[1] 1000
$conf
[1] 1 1
$out
[1] 3 2 3 3 3 2 2 3 2 2 2 2 3 2 2 3 2 2 2 3 2 2 3 3 2 3 3 2 3 3 3 2 2 3 3 3 3 3 2 3 3 2 2 3
[45] 2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 3 3 3 2 2 2 2 2 2 2 3 2 2 2 3 2 2 3 3
[89] 3 2 2 3 2
> boxplot.stats(data$Age)
$stats
[1] 19 27 33 42 64
$n
[1] 1000
zconf
[1] 32.25054 33.74946
$out

[1] 65 74 74 74 65 66 68 66 66 70 67 65 75 66 75 67 65 67 65 66 74 68 68
> boxplot.stats(data$Other_payment_plans)
$stats
[1] 3 3 3 3 3
$n
[1] 1000
$conf
[1] 3 3
$out
[1] 1 1 2 2 2 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[44] 1 1 2 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 1 2 2 2 1 1 2 2 1 1
[87] 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 1 2 2 2 1 1 1 2 1 1 1 1 1 2 2 1 1 2 2 1
[130] 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 2 1 1
[173] 1 1 1 2 1 1 1 2 1 1 1 1 2 1
> boxplot.stats(data$Housing)
$stats
[1] 2 2 2 2 2
$n
[1] 1000
$conf
[1] 2 2
$out
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 3 1 1 1 1 1 1 1 3 3 1 3 1 1 1 3 1 1 1 1 1 1 1 3 1
[44] 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 1 3 3 3 3 1 1 3 1 1 1 1 1 3 1 1 3 3
[87] 1 1 3 1 3 3 1 1 1 3 1 3 1 3 3 3 3 3 1 1 3 1 3 1 3 1 3 1 1 1 1 1 3 1 1 3 1 1 1 1 1 3 1
[130] 3 1 1 1 3 3 3 1 3 1 1 3 1 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 1 1 1 3 1 1 1 1 3 3 1 1 3
[173] 1 3 1 1 3 3 3 1 1 3 1 1 3 3 3 3 3 1 1 3 3 1 3 1 3 1 3 1 3 1 3 1 1 1 1 3 1 3 3 1 1 3 1
[216] 1 1 1 3 3 3 1 3 3 1 1 1 3 1 1 1 1 3 1 3 1 1 1 3 1 3 1 1 1 3 3 3 1 1 1 3 3 1 1 1 1 3 1
[259] 1 1 3 3 1 3 1 3 1 1 1 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 3
> boxplot.stats(data$Existing_credits)
$stats
[1] 1 1 1 2 3
$n
[1] 1000
$conf
[1] 0.950036 1.049964
$out
[1] 4 4 4 4 4 4
> boxplot.stats(data$Job)
$stats
[1] 3 3 3 3 3
$n
[1] 1000
$conf
[1] 3 3
$out
[1] 2 2 2 2 2 2 1 1 4 2 2 2 2 2 2 2 4 2 4 2 2 4 2 2 4 2 4 4 2 2 2 4 2 2 4 4 4 2 4 4 2 4 2
[44] 2 2 4 4 2 2 2 4 2 4 4 2 4 4 4 4 4 4 2 2 4 4 4 2 1 1 2 1 2 2 4 2 2 2 2 2 2 4 2 4 4 2 2
[87] 2 2 4 2 4 2 2 2 4 2 2 2 2 4 2 2 2 2 4 2 4 2 1 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 2 4 4
[130] 2 4 2 2 2 2 2 4 4 4 2 4 2 2 1 2 2 1 1 2 1 4 4 2 4 2 4 1 4 4 2 2 4 2 2 2 4 4 2 1 4 2 2
[173] 2 4 4 1 4 4 4 4 4 4 2 4 2 2 4 2 2 2 2 2 4 4 4 2 2 2 2 4 4 2 4 2 4 4 4 4 4 2 1 2 2 4 2
[216] 4 2 4 4 4 2 2 4 4 2 4 4 4 4 4 4 4 2 4 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2 1 2 2 2 2
[259] 2 2 4 1 2 4 4 4 2 4 4 4 4 4 4 2 2 2 2 2 4 2 1 2 2 4 4 2 4 4 4 4 2 2 2 1 2 4 4 2 4 4 2
[302] 2 4 2 4 4 4 4 2 2 4 4 2 4 4 2 2 2 2 4 2 2 4 4 4 1 4 4 4 4 4 4 4 2 2 2 2 2 4 2 2 2 4 2
[345] 4 2 2 2 2 1 1 2 2 2 2 4 4 2 4 4 4 4 4 2 2 1 4 2 4 4
> boxplot.stats(data$Num_dependents)
$stats

[1] 1 1 1 1 1
$n
[1] 1000
$conf
[1] 1 1
$out
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[44] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[87] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[130] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
> boxplot.stats(data$Foreign_worker)
$stats
[1] 1 1 1 1 1
$n
[1] 1000
$conf
[1] 1 1$out [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

I. Analyze the distribuQon of numeric aEributes (normal or other). Plot histograms for
aEributes of concern and analyze whether they have any inﬂuence on the class
aEribute.
1. The correlaQon funcQon used in R Studio as well as the scaEer matrix show which aEributes may be
correlated to the class.
II. Load the dataset in Weka and click on visualizaQon tab. Which aEributes seem to be
correlated? Which aEributes seem to be most linked to the class aEribute? Inital Analysis:
1. Credit_amount
2. Age
3. Job
4. Savings_status
5. ExisQng_credits
6. Installment_commitment
7. Property_magnitude

According to the above feature selecQon methods the following( in no order) seem to be most correlated:
1. Account Balance
2. Credit History
3. DuraQon
4. Value Savings/Stocks
5. Length of current employment
6. Concurrent Credits
7. Purpose

I. Determine whether the dataset has an imbalanced class distribuRon (same proporRon of records of diﬀerent
types or not).

• There was an imbalance in the CLASS with a raQo up to 40% hence the classes need to be balanced. This is
common in datasets where fraudulent analysis is done.
◦ SMOTE was used to balance the classes and visualize all the classes that were balanced.
I. Determine whether you need to handle missing values or transform any a]ributes. Weka ﬁlters (on the main
tab) can be used for this purpose.
• WEKA shows Missing Values(0%) hence I will not need to deal with those. The CLASS Values are already nominal
& normalizing needs to be done.
◦ Normalize the numeric features so the range or scale is between 0 & 1
◦ All values are correctly categorized as nominal or numeric.

I. Determine the right strategy for dataset split: simple training or tesQng, 10-fold cross validaQon, 3-fold cross
validaQon, etc.
II. InvesQgate the use of different parameters present in Weka for Decision Tree and compare your results obtained
in different sevngs. Understand your decision trees generated by Weka.
III. Repeat the same process for Naïve Bayes and the third classificaQon algorithm of your choice.
I. The following algorithms were run & tested in order to determine the best algorithm for the dataset for the
most accurate results.
IV. Determine your performance measures (accuracy, recall, etc.).
V. IdenQfy which algorithm performs well and in which sevngs.

The highlighted Row above indicates the best parameters and algorithm for the best
results as it was pretty close with the SVM Machine 100% SMOTE but according to the ROC
Curve results Logistic Regression with 100% SMOTE is the best with 10 folds..
SMOTE Percentage
Split
Algorithm Accura
cy
SensiQvity Speciﬁcity Mean Abs
Err
RMS
Err
Rel Abs Err
NO 80-20 LogisQc
Regression
73 69.4 74.83 32.55 42.28 73.97
NO 80-20 MulQlayer
PercepQon
72 64 74.6 27.3 48.1 62.1
NO 80-20 Decision Tree 73.5 69.38 74.6 32.33 48.57 73.48
NO 80-20 Support Vector
Machine
75 72.9 75.6 25 50 56.82
100% 10FCV LogisQc
Regression
76.3 74.66 77.68 30.11 39.88 60.57
100% 10FCV Support Vector
Machine
76.84 74.8 78.5 23.15 48.12 46.58

Post Predictive Modelling:
● InvesQgate the use of K-Means algorithm to segment the data of the predicted class
of importance.
● Analyze each segment (group or cluster) and idenQfy the characterisQcs of customers
(type of records) in each group; e.g., the characterisQcs of a group/cluster can be
determined by finding the majority of aEributes in that group.
● Explain your interpretaQon of characterisQcs and state the recommendaQons for the
organizaQon.
Here are some suggesQons for paEern mining:
● Explore associaQon rules based paEerns for the records of the class of interest by
using the Apriori algorithm in Weka on your dataset.
● You may have to use selected qualitaQve (categorical or ordinal) aEributes to
discover paEerns.
● Try different values for minimum support and confidence, select the values that
provide the appropriate number of rules and jusQfy your selecQon.
● IdenQfy the frequent and logically correct paEerns and state your recommendaQons
for the organizaQon on different types of customers belonging to the predicted class.
Describe your analysis for this secQon in your report

Conclusions & Recommendations:
We went through all the algorithm and models in WEKA.The best result we got is described above in tabular form. The
best soluQon was the SVM 100% SMOTE with 10 fold cross-validaQon. But as can be seen in the ROC Curve comparison
LogisQc SMOTE 100% 10 -fold cross-validaQon yields the highest at 0.8474. Hence this was the best model. It was used to
get results in the output windows which was preEy close to results. The negaQves did not look accurate but that is
because we are dealing with unbalanced dataset.
5.0 Conclusion
ClassificaQon is a form of data analysis that extracts models describing important data classes. We have developed an
effecQve and scalable model using SMOTE in collaboraQon with Support Vector Machine , LogisQc Regression, MulQple
Layer Perceptacron and 10 FCV. We have evaluated the model using several metrics including accuracy, sensiQvity,
Specificity, Mean Absolute Error, Root Mean Square Error and Real Absolute Error. 10-fold cross-validaQon is
recommended for accuracy esQmaQon and Significance tests and ROC curves are useful for model selecQon.

Credit Risk Assessment using Machine Learning Techniques with WEKA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Credit Risk Assessment using Machine Learning Techniques with WEKA

Similar to Credit Risk Assessment using Machine Learning Techniques with WEKA (20)

Recently uploaded

Recently uploaded (20)

Credit Risk Assessment using Machine Learning Techniques with WEKA