The goal of this projects is to obtain a machine learning model to perform credit scoring. Credit risk
assessment of an applicant is vital to the banking sector. There are 20 attributes used in judging a loan applicant( 7 numerical, 13 categorical/nominal). The goal is to classify the applicant in one of two categories. Good or Bad. A few Algorithms were tested out for accuracy and the best model was chosen to predict the data which was was analyzed to see how well the model predicted.
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Credit Risk Assessment using Machine Learning Techniques with WEKA
1. German Credit Data
Members
Mehnaz Newaz, mnewaz@ryerson.ca
Mashfiq Shahriar, mshahriar@ryerson.ca
Summary: The goal of this projects is to obtain a machine learning model to perform credit scoring. Credit risk
assessment of an applicant is vital to the banking sector. There are 20 aEributes used in judging a loan applicant( 7
numerical, 13 categorical/nominal). The goal is to classify the applicant in one of two categories. Good or Bad. A few
Algorithms were tested out for accuracy and the best model was chosen to predict the data which was was analyzed to
see how well the model predicted.
Workload Distribution:
In this secQon, you need to menQon who did what in the project.
Member Name List of Tasks Performed
Mehnaz Newaz(50%)
(EQUAL DISTRIBUTION OF WORK)
Data Preparation
Predictive Modeling/ClassiSication
Post-prediction Analysis
Conclusions and Recommendations
MashSiq Shahriar(50%)
(EQUAL DISTRIBUTION OF WORK)
Data Preparation
Predictive Modeling/ClassiSication
Post-prediction Analysis
Conclusions and Recommendations
2. Data Preparation:
I. Look at the aEribute type; e.g., nominal, ordinal or quanQtaQve.
@relaQon 'german_credit-weka.filters.unsupervised.aEribute.Reorder-
R2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,1'
◦ @aEribute 'Account Balance' {1,2,3,4}
◦ @aEribute 'DuraQon of Credit (month)' numeric
◦ @aEribute 'Payment Status of Previous Credit' {0,1,2,3,4}
◦ @aEribute Purpose {0,1,2,3,4,5,6,8,9,10}
◦ @aEribute 'Credit Amount' numeric
◦ @aEribute 'Value Savings/Stocks' {1,2,3,4,5}
◦ @aEribute 'Length of current employment' {1,2,3,4,5}
◦ @aEribute 'Instalment per cent' numeric
◦ @aEribute 'Sex & Marital Status' {1,2,3,4}
◦ @aEribute Guarantors {1,2,3}
◦ @aEribute 'DuraQon in Current address' {1,2,3,4}
◦ @aEribute 'Most valuable available asset' {1,2,3,4}
◦ @aEribute 'Age (years)' numeric
◦ @aEribute 'Concurrent Credits' {1,2,3}
◦ @aEribute 'Type of apartment' {1,2,3}
◦ @aEribute 'No of Credits at this Bank' numeric
◦ @aEribute OccupaQon {1,2,3,4}
◦ @aEribute 'No of dependents' numeric
◦ @aEribute Telephone {1,2}
◦ @aEribute 'Foreign Worker' {1,2}
◦ @aEribute Creditability {0,1}
II. Find max, min, mean and standard deviaQon of aEributes.
III. Determine any outlier values (records) for each of the aEributes or aEributes underconsideraQon are as follows
for the ones that did have outliers.
```{r data,echo=FALSE,message=FALSE,warning=FALSE}
## ImporQng packages
library(MASS)
library(car)
library(caret)
library(randomForest)
library(ROCR)
library(e1071)
## Loading data into environment
data <- read.csv("~/Desktop/FinalProject/CIND119/german_credit_card/german_credit.csv",header=
T,stringsAsFactors = F, na.strings = c("","NA"))
german=c("CreditStatus","Checking_Status","DuraRon","Credit_history", "Purpose" ,"Credit_amount",
"Savings_status", "Employment", "Installment_Commitment", "Personal_status","Other_parRes", "Residence_since",
"Property_Magnitude", "Age", "Other_payment_plans" ,"Housing",
"ExisRng_credits","Job","Num_dependents","Own_telephone", "Foreign_worker" )
names(data) = german
#Variables names in dataset
names(data)
str(data)
11. I. Analyze the distribuQon of numeric aEributes (normal or other). Plot histograms for
aEributes of concern and analyze whether they have any influence on the class
aEribute.
1. The correlaQon funcQon used in R Studio as well as the scaEer matrix show which aEributes may be
correlated to the class.
II. Load the dataset in Weka and click on visualizaQon tab. Which aEributes seem to be
correlated? Which aEributes seem to be most linked to the class aEribute? Inital Analysis:
1. Credit_amount
2. Age
3. Job
4. Savings_status
5. ExisQng_credits
6. Installment_commitment
7. Property_magnitude
12. According to the above feature selecQon methods the following( in no order) seem to be most correlated:
1. Account Balance
2. Credit History
3. DuraQon
4. Value Savings/Stocks
5. Length of current employment
6. Concurrent Credits
7. Purpose
13.
14.
15. I. Determine whether the dataset has an imbalanced class distribuRon (same proporRon of records of different
types or not).
16. • There was an imbalance in the CLASS with a raQo up to 40% hence the classes need to be balanced. This is
common in datasets where fraudulent analysis is done.
◦ SMOTE was used to balance the classes and visualize all the classes that were balanced.
I. Determine whether you need to handle missing values or transform any a]ributes. Weka filters (on the main
tab) can be used for this purpose.
• WEKA shows Missing Values(0%) hence I will not need to deal with those. The CLASS Values are already nominal
& normalizing needs to be done.
◦ Normalize the numeric features so the range or scale is between 0 & 1
◦ All values are correctly categorized as nominal or numeric.
17. Predictive Modeling/ClassiSication
I. Determine the right strategy for dataset split: simple training or tesQng, 10-fold cross validaQon, 3-fold cross
validaQon, etc.
II. InvesQgate the use of different parameters present in Weka for Decision Tree and compare your results obtained
in different sevngs. Understand your decision trees generated by Weka.
III. Repeat the same process for Naïve Bayes and the third classificaQon algorithm of your choice.
I. The following algorithms were run & tested in order to determine the best algorithm for the dataset for the
most accurate results.
IV. Determine your performance measures (accuracy, recall, etc.).
V. IdenQfy which algorithm performs well and in which sevngs.
21. Post Predictive Modelling:
● InvesQgate the use of K-Means algorithm to segment the data of the predicted class
of importance.
● Analyze each segment (group or cluster) and idenQfy the characterisQcs of customers
(type of records) in each group; e.g., the characterisQcs of a group/cluster can be
determined by finding the majority of aEributes in that group.
● Explain your interpretaQon of characterisQcs and state the recommendaQons for the
organizaQon.
Here are some suggesQons for paEern mining:
● Explore associaQon rules based paEerns for the records of the class of interest by
using the Apriori algorithm in Weka on your dataset.
● You may have to use selected qualitaQve (categorical or ordinal) aEributes to
discover paEerns.
● Try different values for minimum support and confidence, select the values that
provide the appropriate number of rules and jusQfy your selecQon.
● IdenQfy the frequent and logically correct paEerns and state your recommendaQons
for the organizaQon on different types of customers belonging to the predicted class.
Describe your analysis for this secQon in your report
22.
23. Conclusions & Recommendations:
We went through all the algorithm and models in WEKA.The best result we got is described above in tabular form. The
best soluQon was the SVM 100% SMOTE with 10 fold cross-validaQon. But as can be seen in the ROC Curve comparison
LogisQc SMOTE 100% 10 -fold cross-validaQon yields the highest at 0.8474. Hence this was the best model. It was used to
get results in the output windows which was preEy close to results. The negaQves did not look accurate but that is
because we are dealing with unbalanced dataset.
5.0 Conclusion
ClassificaQon is a form of data analysis that extracts models describing important data classes. We have developed an
effecQve and scalable model using SMOTE in collaboraQon with Support Vector Machine , LogisQc Regression, MulQple
Layer Perceptacron and 10 FCV. We have evaluated the model using several metrics including accuracy, sensiQvity,
Specificity, Mean Absolute Error, Root Mean Square Error and Real Absolute Error. 10-fold cross-validaQon is
recommended for accuracy esQmaQon and Significance tests and ROC curves are useful for model selecQon.