SlideShare a Scribd company logo
1 of 23
Download to read offline
German	Credit	Data	
Members	
Mehnaz Newaz, mnewaz@ryerson.ca
Mashfiq Shahriar, mshahriar@ryerson.ca
Summary: The goal of this projects is to obtain a machine learning model to perform credit scoring. Credit risk
assessment of an applicant is vital to the banking sector. There are 20 aEributes used in judging a loan applicant( 7
numerical, 13 categorical/nominal). The goal is to classify the applicant in one of two categories. Good or Bad. A few
Algorithms were tested out for accuracy and the best model was chosen to predict the data which was was analyzed to
see how well the model predicted.
Workload	Distribution:		
In this secQon, you need to menQon who did what in the project.
Member	Name List	of	Tasks	Performed
Mehnaz	Newaz(50%)	
(EQUAL	DISTRIBUTION	OF	WORK)
Data	Preparation	
Predictive	Modeling/ClassiSication	
Post-prediction	Analysis		
Conclusions	and	Recommendations
MashSiq	Shahriar(50%)	
(EQUAL	DISTRIBUTION	OF	WORK)
Data	Preparation	
Predictive	Modeling/ClassiSication	
Post-prediction	Analysis		
Conclusions	and	Recommendations
Data	Preparation:	
I. Look at the aEribute type; e.g., nominal, ordinal or quanQtaQve.
@relaQon 'german_credit-weka.filters.unsupervised.aEribute.Reorder-
R2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,1'
◦ @aEribute 'Account Balance' {1,2,3,4}
◦ @aEribute 'DuraQon of Credit (month)' numeric
◦ @aEribute 'Payment Status of Previous Credit' {0,1,2,3,4}
◦ @aEribute Purpose {0,1,2,3,4,5,6,8,9,10}
◦ @aEribute 'Credit Amount' numeric
◦ @aEribute 'Value Savings/Stocks' {1,2,3,4,5}
◦ @aEribute 'Length of current employment' {1,2,3,4,5}
◦ @aEribute 'Instalment per cent' numeric
◦ @aEribute 'Sex & Marital Status' {1,2,3,4}
◦ @aEribute Guarantors {1,2,3}
◦ @aEribute 'DuraQon in Current address' {1,2,3,4}
◦ @aEribute 'Most valuable available asset' {1,2,3,4}
◦ @aEribute 'Age (years)' numeric
◦ @aEribute 'Concurrent Credits' {1,2,3}
◦ @aEribute 'Type of apartment' {1,2,3}
◦ @aEribute 'No of Credits at this Bank' numeric
◦ @aEribute OccupaQon {1,2,3,4}
◦ @aEribute 'No of dependents' numeric
◦ @aEribute Telephone {1,2}
◦ @aEribute 'Foreign Worker' {1,2}
◦ @aEribute Creditability {0,1}
II. Find max, min, mean and standard deviaQon of aEributes.
III. Determine any outlier values (records) for each of the aEributes or aEributes underconsideraQon are as follows
for the ones that did have outliers.
```{r data,echo=FALSE,message=FALSE,warning=FALSE}
## ImporQng packages
library(MASS)
library(car)
library(caret)
library(randomForest)
library(ROCR)
library(e1071)
## Loading data into environment
data <- read.csv("~/Desktop/FinalProject/CIND119/german_credit_card/german_credit.csv",header=
T,stringsAsFactors = F, na.strings = c("","NA"))
german=c("CreditStatus","Checking_Status","DuraRon","Credit_history", "Purpose" ,"Credit_amount",
"Savings_status", "Employment", "Installment_Commitment", "Personal_status","Other_parRes", "Residence_since",
"Property_Magnitude", "Age", "Other_payment_plans" ,"Housing",
"ExisRng_credits","Job","Num_dependents","Own_telephone", "Foreign_worker" )
names(data) = german
#Variables names in dataset
names(data)
str(data)
table(data$credit.CreditStatus)
table(data$credit.CreditStatus)/nrow(data)
summary(data)
summary(data$DuraRon)
summary(data$Age)
summary(data$Credit_amount )
boxplot(data$CreditStatus)
boxplot(data$Checking_Status)
boxplot(data$DuraRon)
boxplot(data$Credit_history)
boxplot(data$Purpose)
boxplot(data$Credit_amount)
boxplot(data$Savings_status)
boxplot(data$Employment)
boxplot(data$Installment_Commitment)
boxplot(data$Personal_status)
boxplot(data$Other_parRes)
boxplot(data$Residence_since)
boxplot(data$Property_Magnitude)
boxplot(data$Age)
boxplot(data$Other_payment_plans)
boxplot(data$Housing)
boxplot(data$ExisRng_credits)
boxplot(data$Job)
boxplot(data$Num_dependents)
boxplot(data$Own_telephone)
boxplot(data$Foreign_worker)
boxplot.stats(data$CreditStatus)
boxplot.stats(data$Checking_Status)
boxplot.stats(data$DuraRon)
boxplot.stats(data$Credit_history)
boxplot.stats(data$Purpose)
boxplot.stats(data$Credit_amount)
boxplot.stats(data$Savings_status)
boxplot.stats(data$Employment)
boxplot.stats(data$Installment_Commitment)
boxplot.stats(data$Personal_status)
boxplot.stats(data$Other_parRes)
boxplot.stats(data$Residence_since)
boxplot.stats(data$Property_Magnitude)
boxplot.stats(data$Age)
boxplot.stats(data$Other_payment_plans)
boxplot.stats(data$Housing)
boxplot.stats(data$ExisRng_credits)
boxplot.stats(data$Job)
boxplot.stats(data$Num_dependents)
boxplot.stats(data$Own_telephone)
boxplot.stats(data$Foreign_worker)
cor(data)
write.csv(data, "final.csv")
```
> boxplot.stats(data$Duration)
$stats
[1] 4 12 18 24 42
$n
[1] 1000
$conf
[1] 17.40043 18.59957
$out
[1] 48 48 48 48 48 48 48 48 47 48 60 54 48 48 60 48 60 48 48 60 48 48 48 48 48 48 60 60 45
[30] 48 48 48 72 48 60 60 48 60 60 48 60 48 48 48 60 45 48 48 48 48 48 48 48 60 48 48 48 48
[59] 48 48 45 48 54 48 45 48 45 48 48 48
> boxplot.stats(data$Purpose)
$stats
[1] 0 1 2 3 6
$n
[1] 1000
$conf
[1] 1.900072 2.099928
$out
[1] 9 10 10 9 9 9 9 9 9 9 9 9 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
[30] 8 9 9 9 8 9 9 9 8 9 9 9 10 9 9 9 9 9 9 9 9 9 9 10 9 9 8 9 9
[59] 9 9 9 9 8 9 9 9 9 9 10 10 8 9 9 10 9 9 8 9 9 9 9 9 9 9 9 9 9
[88] 9 9 9 9 9 10 10 10 9 9 9 9 10 9 9 9 9 8 9 9 9 9 9 9 9 9 9 9 10
[117] 9 9
> boxplot.stats(data$Credit_amount)
$stats
[1] 250.0 1365.0 2319.5 3972.5 7882.0
$n
[1] 1000
$conf
[1] 2189.219 2449.781
$out
[1] 10875 8858 12749 8072 8487 12169 10722 8613 8588 10366 8133 9436 10477 13756
[15] 11760 14179 10974 9566 8358 9857 10222 9055 7966 12204 8229 10623 9277 15857
[29] 10144 15653 8335 8471 8947 11054 9157 9283 14555 9271 8386 14318 15672 10961
[43] 7980 11560 11328 11938 14782 12612 9398 9572 8065 9034 14027 9629 12976 10297
[57] 14421 8086 10127 12389 11590 15945 9960 8648 8318 11816 11998 18424 14896 8978
[71] 12579 12680
> boxplot.stats(data$Other_parties)
$stats
[1] 1 1 1 1 1
$n
[1] 1000
$conf
[1] 1 1
$out
[1] 3 2 3 3 3 2 2 3 2 2 2 2 3 2 2 3 2 2 2 3 2 2 3 3 2 3 3 2 3 3 3 2 2 3 3 3 3 3 2 3 3 2 2 3
[45] 2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 3 3 3 2 2 2 2 2 2 2 3 2 2 2 3 2 2 3 3
[89] 3 2 2 3 2
> boxplot.stats(data$Age)
$stats
[1] 19 27 33 42 64
$n
[1] 1000
zconf
[1] 32.25054 33.74946
$out
[1] 65 74 74 74 65 66 68 66 66 70 67 65 75 66 75 67 65 67 65 66 74 68 68
> boxplot.stats(data$Other_payment_plans)
$stats
[1] 3 3 3 3 3
$n
[1] 1000
$conf
[1] 3 3
$out
[1] 1 1 2 2 2 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[44] 1 1 2 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 1 2 2 2 1 1 2 2 1 1
[87] 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 1 2 2 2 1 1 1 2 1 1 1 1 1 2 2 1 1 2 2 1
[130] 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 2 1 1
[173] 1 1 1 2 1 1 1 2 1 1 1 1 2 1
> boxplot.stats(data$Housing)
$stats
[1] 2 2 2 2 2
$n
[1] 1000
$conf
[1] 2 2
$out
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 3 1 1 1 1 1 1 1 3 3 1 3 1 1 1 3 1 1 1 1 1 1 1 3 1
[44] 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 1 3 3 3 3 1 1 3 1 1 1 1 1 3 1 1 3 3
[87] 1 1 3 1 3 3 1 1 1 3 1 3 1 3 3 3 3 3 1 1 3 1 3 1 3 1 3 1 1 1 1 1 3 1 1 3 1 1 1 1 1 3 1
[130] 3 1 1 1 3 3 3 1 3 1 1 3 1 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 1 1 1 3 1 1 1 1 3 3 1 1 3
[173] 1 3 1 1 3 3 3 1 1 3 1 1 3 3 3 3 3 1 1 3 3 1 3 1 3 1 3 1 3 1 3 1 1 1 1 3 1 3 3 1 1 3 1
[216] 1 1 1 3 3 3 1 3 3 1 1 1 3 1 1 1 1 3 1 3 1 1 1 3 1 3 1 1 1 3 3 3 1 1 1 3 3 1 1 1 1 3 1
[259] 1 1 3 3 1 3 1 3 1 1 1 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 3
> boxplot.stats(data$Existing_credits)
$stats
[1] 1 1 1 2 3
$n
[1] 1000
$conf
[1] 0.950036 1.049964
$out
[1] 4 4 4 4 4 4
> boxplot.stats(data$Job)
$stats
[1] 3 3 3 3 3
$n
[1] 1000
$conf
[1] 3 3
$out
[1] 2 2 2 2 2 2 1 1 4 2 2 2 2 2 2 2 4 2 4 2 2 4 2 2 4 2 4 4 2 2 2 4 2 2 4 4 4 2 4 4 2 4 2
[44] 2 2 4 4 2 2 2 4 2 4 4 2 4 4 4 4 4 4 2 2 4 4 4 2 1 1 2 1 2 2 4 2 2 2 2 2 2 4 2 4 4 2 2
[87] 2 2 4 2 4 2 2 2 4 2 2 2 2 4 2 2 2 2 4 2 4 2 1 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 2 4 4
[130] 2 4 2 2 2 2 2 4 4 4 2 4 2 2 1 2 2 1 1 2 1 4 4 2 4 2 4 1 4 4 2 2 4 2 2 2 4 4 2 1 4 2 2
[173] 2 4 4 1 4 4 4 4 4 4 2 4 2 2 4 2 2 2 2 2 4 4 4 2 2 2 2 4 4 2 4 2 4 4 4 4 4 2 1 2 2 4 2
[216] 4 2 4 4 4 2 2 4 4 2 4 4 4 4 4 4 4 2 4 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2 1 2 2 2 2
[259] 2 2 4 1 2 4 4 4 2 4 4 4 4 4 4 2 2 2 2 2 4 2 1 2 2 4 4 2 4 4 4 4 2 2 2 1 2 4 4 2 4 4 2
[302] 2 4 2 4 4 4 4 2 2 4 4 2 4 4 2 2 2 2 4 2 2 4 4 4 1 4 4 4 4 4 4 4 2 2 2 2 2 4 2 2 2 4 2
[345] 4 2 2 2 2 1 1 2 2 2 2 4 4 2 4 4 4 4 4 2 2 1 4 2 4 4
> boxplot.stats(data$Num_dependents)
$stats
[1] 1 1 1 1 1
$n
[1] 1000
$conf
[1] 1 1
$out
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[44] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[87] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[130] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
> boxplot.stats(data$Foreign_worker)
$stats
[1] 1 1 1 1 1
$n
[1] 1000
$conf
[1] 1 1$out [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
I. Analyze the distribuQon of numeric aEributes (normal or other). Plot histograms for
aEributes of concern and analyze whether they have any influence on the class
aEribute.
1. The correlaQon funcQon used in R Studio as well as the scaEer matrix show which aEributes may be
correlated to the class.
II. Load the dataset in Weka and click on visualizaQon tab. Which aEributes seem to be
correlated? Which aEributes seem to be most linked to the class aEribute? Inital Analysis:
1. Credit_amount
2. Age
3. Job
4. Savings_status
5. ExisQng_credits
6. Installment_commitment
7. Property_magnitude
According to the above feature selecQon methods the following( in no order) seem to be most correlated:
1. Account Balance
2. Credit History
3. DuraQon
4. Value Savings/Stocks
5. Length of current employment
6. Concurrent Credits
7. Purpose
I. Determine whether the dataset has an imbalanced class distribuRon (same proporRon of records of different
types or not).
• There was an imbalance in the CLASS with a raQo up to 40% hence the classes need to be balanced. This is
common in datasets where fraudulent analysis is done.
◦ SMOTE was used to balance the classes and visualize all the classes that were balanced.
I. Determine whether you need to handle missing values or transform any a]ributes. Weka filters (on the main
tab) can be used for this purpose.
• WEKA shows Missing Values(0%) hence I will not need to deal with those. The CLASS Values are already nominal
& normalizing needs to be done.
◦ Normalize the numeric features so the range or scale is between 0 & 1
◦ All values are correctly categorized as nominal or numeric.
Predictive	Modeling/ClassiSication	
I. Determine the right strategy for dataset split: simple training or tesQng, 10-fold cross validaQon, 3-fold cross
validaQon, etc.
II. InvesQgate the use of different parameters present in Weka for Decision Tree and compare your results obtained
in different sevngs. Understand your decision trees generated by Weka.
III. Repeat the same process for Naïve Bayes and the third classificaQon algorithm of your choice.
I. The following algorithms were run & tested in order to determine the best algorithm for the dataset for the
most accurate results.
IV. Determine your performance measures (accuracy, recall, etc.).
V. IdenQfy which algorithm performs well and in which sevngs.
The	highlighted	Row	above	indicates	the	best	parameters	and	algorithm	for	the	best	
results	as	it	was	pretty	close	with	the	SVM	Machine	100%	SMOTE	but	according	to	the	ROC	
Curve	results	Logistic	Regression	with	100%	SMOTE	is	the	best	with	10	folds..	
SMOTE Percentage
Split
Algorithm Accura
cy
SensiQvity Specificity Mean Abs
Err
RMS
Err
Rel Abs Err
NO 80-20 LogisQc
Regression
73 69.4 74.83 32.55 42.28 73.97
NO 80-20 MulQlayer
PercepQon
72 64 74.6 27.3 48.1 62.1
NO 80-20 Decision Tree 73.5 69.38 74.6 32.33 48.57 73.48
NO 80-20 Support Vector
Machine
75 72.9 75.6 25 50 56.82
100% 10FCV LogisQc
Regression
76.3 74.66 77.68 30.11 39.88 60.57
100% 10FCV Support Vector
Machine
76.84 74.8 78.5 23.15 48.12 46.58
Post	Predictive	Modelling:	
● InvesQgate the use of K-Means algorithm to segment the data of the predicted class
of importance.
● Analyze each segment (group or cluster) and idenQfy the characterisQcs of customers
(type of records) in each group; e.g., the characterisQcs of a group/cluster can be
determined by finding the majority of aEributes in that group.
● Explain your interpretaQon of characterisQcs and state the recommendaQons for the
organizaQon.
Here are some suggesQons for paEern mining:
● Explore associaQon rules based paEerns for the records of the class of interest by
using the Apriori algorithm in Weka on your dataset.
● You may have to use selected qualitaQve (categorical or ordinal) aEributes to
discover paEerns.
● Try different values for minimum support and confidence, select the values that
provide the appropriate number of rules and jusQfy your selecQon.
● IdenQfy the frequent and logically correct paEerns and state your recommendaQons
for the organizaQon on different types of customers belonging to the predicted class.
Describe your analysis for this secQon in your report
Conclusions	&	Recommendations:	
We went through all the algorithm and models in WEKA.The best result we got is described above in tabular form. The
best soluQon was the SVM 100% SMOTE with 10 fold cross-validaQon. But as can be seen in the ROC Curve comparison
LogisQc SMOTE 100% 10 -fold cross-validaQon yields the highest at 0.8474. Hence this was the best model. It was used to
get results in the output windows which was preEy close to results. The negaQves did not look accurate but that is
because we are dealing with unbalanced dataset.
5.0 Conclusion
ClassificaQon is a form of data analysis that extracts models describing important data classes. We have developed an
effecQve and scalable model using SMOTE in collaboraQon with Support Vector Machine , LogisQc Regression, MulQple
Layer Perceptacron and 10 FCV. We have evaluated the model using several metrics including accuracy, sensiQvity,
Specificity, Mean Absolute Error, Root Mean Square Error and Real Absolute Error. 10-fold cross-validaQon is
recommended for accuracy esQmaQon and Significance tests and ROC curves are useful for model selecQon.

More Related Content

What's hot

Expert Judgement Credit Rating for SME & Commercial Customers
Expert Judgement Credit Rating for SME & Commercial CustomersExpert Judgement Credit Rating for SME & Commercial Customers
Expert Judgement Credit Rating for SME & Commercial CustomersMike Coates
 
Credit Risk Evaluation Model
Credit Risk Evaluation ModelCredit Risk Evaluation Model
Credit Risk Evaluation ModelMihai Enescu
 
AutoCloud - Loan management System- NBFC software
AutoCloud - Loan management System- NBFC softwareAutoCloud - Loan management System- NBFC software
AutoCloud - Loan management System- NBFC softwareICFAI Business School
 
Creditscore
CreditscoreCreditscore
Creditscorekevinlan
 
Default Credit Card Prediction
Default Credit Card PredictionDefault Credit Card Prediction
Default Credit Card PredictionAlexandre Pinto
 
02 la gestion des produits d'assurance (2011)
02   la gestion des produits d'assurance (2011)02   la gestion des produits d'assurance (2011)
02 la gestion des produits d'assurance (2011)lionelmachado
 
Predictive Model for Loan Approval Process using SAS 9.3_M1
Predictive Model for Loan Approval Process using SAS 9.3_M1Predictive Model for Loan Approval Process using SAS 9.3_M1
Predictive Model for Loan Approval Process using SAS 9.3_M1Akanksha Jain
 
Predicting Credit Card Defaults using Machine Learning Algorithms
Predicting Credit Card Defaults using Machine Learning AlgorithmsPredicting Credit Card Defaults using Machine Learning Algorithms
Predicting Credit Card Defaults using Machine Learning AlgorithmsSagar Tupkar
 
Tips for IT Risk Management Prof. Hernan Huwyler Information Security Institute
Tips for IT Risk Management Prof. Hernan Huwyler Information Security InstituteTips for IT Risk Management Prof. Hernan Huwyler Information Security Institute
Tips for IT Risk Management Prof. Hernan Huwyler Information Security InstituteHernan Huwyler, MBA CPA
 
Sound Credit Risk Experience Sharing Vietnam Fsa And Bank
Sound Credit Risk Experience Sharing   Vietnam Fsa And BankSound Credit Risk Experience Sharing   Vietnam Fsa And Bank
Sound Credit Risk Experience Sharing Vietnam Fsa And BankEric Kuo
 
Predicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditPredicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditpragativbora
 
Loan Default Prediction with Machine Learning
Loan Default Prediction with Machine LearningLoan Default Prediction with Machine Learning
Loan Default Prediction with Machine LearningAlibaba Cloud
 
Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Vatsal N Shah
 
Presentation safex pay
Presentation safex payPresentation safex pay
Presentation safex payJyotiBisht23
 
J.P Morgan Chase & Co Presentation
J.P Morgan Chase & Co PresentationJ.P Morgan Chase & Co Presentation
J.P Morgan Chase & Co PresentationSon Nguyen
 
Excel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for AuditorsExcel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for AuditorsJim Kaplan CIA CFE
 
National payment system architecture
National payment system architectureNational payment system architecture
National payment system architectureAnil Chaurasiya
 
Camels rating system
Camels rating systemCamels rating system
Camels rating systemirum_iiui
 

What's hot (20)

Expert Judgement Credit Rating for SME & Commercial Customers
Expert Judgement Credit Rating for SME & Commercial CustomersExpert Judgement Credit Rating for SME & Commercial Customers
Expert Judgement Credit Rating for SME & Commercial Customers
 
Credit Risk Evaluation Model
Credit Risk Evaluation ModelCredit Risk Evaluation Model
Credit Risk Evaluation Model
 
AutoCloud - Loan management System- NBFC software
AutoCloud - Loan management System- NBFC softwareAutoCloud - Loan management System- NBFC software
AutoCloud - Loan management System- NBFC software
 
Creditscore
CreditscoreCreditscore
Creditscore
 
Default Credit Card Prediction
Default Credit Card PredictionDefault Credit Card Prediction
Default Credit Card Prediction
 
02 la gestion des produits d'assurance (2011)
02   la gestion des produits d'assurance (2011)02   la gestion des produits d'assurance (2011)
02 la gestion des produits d'assurance (2011)
 
Predictive Model for Loan Approval Process using SAS 9.3_M1
Predictive Model for Loan Approval Process using SAS 9.3_M1Predictive Model for Loan Approval Process using SAS 9.3_M1
Predictive Model for Loan Approval Process using SAS 9.3_M1
 
Predicting Credit Card Defaults using Machine Learning Algorithms
Predicting Credit Card Defaults using Machine Learning AlgorithmsPredicting Credit Card Defaults using Machine Learning Algorithms
Predicting Credit Card Defaults using Machine Learning Algorithms
 
Tips for IT Risk Management Prof. Hernan Huwyler Information Security Institute
Tips for IT Risk Management Prof. Hernan Huwyler Information Security InstituteTips for IT Risk Management Prof. Hernan Huwyler Information Security Institute
Tips for IT Risk Management Prof. Hernan Huwyler Information Security Institute
 
Sound Credit Risk Experience Sharing Vietnam Fsa And Bank
Sound Credit Risk Experience Sharing   Vietnam Fsa And BankSound Credit Risk Experience Sharing   Vietnam Fsa And Bank
Sound Credit Risk Experience Sharing Vietnam Fsa And Bank
 
Presentation on credit risk
Presentation on credit risk Presentation on credit risk
Presentation on credit risk
 
Predicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditPredicting Delinquency-Give me some credit
Predicting Delinquency-Give me some credit
 
Loan Default Prediction with Machine Learning
Loan Default Prediction with Machine LearningLoan Default Prediction with Machine Learning
Loan Default Prediction with Machine Learning
 
Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients
 
Presentation safex pay
Presentation safex payPresentation safex pay
Presentation safex pay
 
J.P Morgan Chase & Co Presentation
J.P Morgan Chase & Co PresentationJ.P Morgan Chase & Co Presentation
J.P Morgan Chase & Co Presentation
 
Excel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for AuditorsExcel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for Auditors
 
7 stages in loan origination
7 stages in loan origination7 stages in loan origination
7 stages in loan origination
 
National payment system architecture
National payment system architectureNational payment system architecture
National payment system architecture
 
Camels rating system
Camels rating systemCamels rating system
Camels rating system
 

Similar to Credit Risk Assessment using Machine Learning Techniques with WEKA

第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)Wataru Shito
 
More than 12 More things about Oracle Database 12c
More than 12 More things about Oracle Database 12cMore than 12 More things about Oracle Database 12c
More than 12 More things about Oracle Database 12cGuatemala User Group
 
Metadata Matters
Metadata MattersMetadata Matters
Metadata Mattersafa reg
 
第3回 データフレームの基本操作 その1(解答付き)
第3回 データフレームの基本操作 その1(解答付き)第3回 データフレームの基本操作 その1(解答付き)
第3回 データフレームの基本操作 その1(解答付き)Wataru Shito
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin NUI Galway
 
第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出し第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出しWataru Shito
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache CalciteDataWorks Summit
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...Databricks
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applicationsKexin Xie
 
(APP203) How Sumo Logic and Anki Build Highly Resilient Services on AWS to Ma...
(APP203) How Sumo Logic and Anki Build Highly Resilient Services on AWS to Ma...(APP203) How Sumo Logic and Anki Build Highly Resilient Services on AWS to Ma...
(APP203) How Sumo Logic and Anki Build Highly Resilient Services on AWS to Ma...Amazon Web Services
 
M11 bagging loo cv
M11 bagging loo cvM11 bagging loo cv
M11 bagging loo cvRaman Kannan
 
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard WorldMonitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard WorldBrian Troutwine
 
第4回 データフレームの基本操作 その2
第4回 データフレームの基本操作 その2第4回 データフレームの基本操作 その2
第4回 データフレームの基本操作 その2Wataru Shito
 
The Ring programming language version 1.7 book - Part 64 of 196
The Ring programming language version 1.7 book - Part 64 of 196The Ring programming language version 1.7 book - Part 64 of 196
The Ring programming language version 1.7 book - Part 64 of 196Mahmoud Samir Fayed
 
Forecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series ModelsForecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series ModelsGeoffery Mullings
 

Similar to Credit Risk Assessment using Machine Learning Techniques with WEKA (20)

第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
第5回 様々なファイル形式の読み込みとデータの書き出し(解答付き)
 
More than 12 More things about Oracle Database 12c
More than 12 More things about Oracle Database 12cMore than 12 More things about Oracle Database 12c
More than 12 More things about Oracle Database 12c
 
Metadata Matters
Metadata MattersMetadata Matters
Metadata Matters
 
第3回 データフレームの基本操作 その1(解答付き)
第3回 データフレームの基本操作 その1(解答付き)第3回 データフレームの基本操作 その1(解答付き)
第3回 データフレームの基本操作 その1(解答付き)
 
11o Φ.Α. 8.1.pdf
11o Φ.Α. 8.1.pdf11o Φ.Α. 8.1.pdf
11o Φ.Α. 8.1.pdf
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin
 
第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出し第5回 様々なファイル形式の読み込みとデータの書き出し
第5回 様々なファイル形式の読み込みとデータの書き出し
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Stat ml project natsarankorn
Stat ml project natsarankornStat ml project natsarankorn
Stat ml project natsarankorn
 
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
ORACLE_23-03-31_en.pdf
ORACLE_23-03-31_en.pdfORACLE_23-03-31_en.pdf
ORACLE_23-03-31_en.pdf
 
(APP203) How Sumo Logic and Anki Build Highly Resilient Services on AWS to Ma...
(APP203) How Sumo Logic and Anki Build Highly Resilient Services on AWS to Ma...(APP203) How Sumo Logic and Anki Build Highly Resilient Services on AWS to Ma...
(APP203) How Sumo Logic and Anki Build Highly Resilient Services on AWS to Ma...
 
M11 bagging loo cv
M11 bagging loo cvM11 bagging loo cv
M11 bagging loo cv
 
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard WorldMonitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
 
第4回 データフレームの基本操作 その2
第4回 データフレームの基本操作 その2第4回 データフレームの基本操作 その2
第4回 データフレームの基本操作 その2
 
The Ring programming language version 1.7 book - Part 64 of 196
The Ring programming language version 1.7 book - Part 64 of 196The Ring programming language version 1.7 book - Part 64 of 196
The Ring programming language version 1.7 book - Part 64 of 196
 
Forecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series ModelsForecasting Revenue With Stationary Time Series Models
Forecasting Revenue With Stationary Time Series Models
 
5 Cool Things About SQL
5 Cool Things About SQL5 Cool Things About SQL
5 Cool Things About SQL
 

Recently uploaded

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

Credit Risk Assessment using Machine Learning Techniques with WEKA

  • 1. German Credit Data Members Mehnaz Newaz, mnewaz@ryerson.ca Mashfiq Shahriar, mshahriar@ryerson.ca Summary: The goal of this projects is to obtain a machine learning model to perform credit scoring. Credit risk assessment of an applicant is vital to the banking sector. There are 20 aEributes used in judging a loan applicant( 7 numerical, 13 categorical/nominal). The goal is to classify the applicant in one of two categories. Good or Bad. A few Algorithms were tested out for accuracy and the best model was chosen to predict the data which was was analyzed to see how well the model predicted. Workload Distribution: In this secQon, you need to menQon who did what in the project. Member Name List of Tasks Performed Mehnaz Newaz(50%) (EQUAL DISTRIBUTION OF WORK) Data Preparation Predictive Modeling/ClassiSication Post-prediction Analysis Conclusions and Recommendations MashSiq Shahriar(50%) (EQUAL DISTRIBUTION OF WORK) Data Preparation Predictive Modeling/ClassiSication Post-prediction Analysis Conclusions and Recommendations
  • 2. Data Preparation: I. Look at the aEribute type; e.g., nominal, ordinal or quanQtaQve. @relaQon 'german_credit-weka.filters.unsupervised.aEribute.Reorder- R2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,1' ◦ @aEribute 'Account Balance' {1,2,3,4} ◦ @aEribute 'DuraQon of Credit (month)' numeric ◦ @aEribute 'Payment Status of Previous Credit' {0,1,2,3,4} ◦ @aEribute Purpose {0,1,2,3,4,5,6,8,9,10} ◦ @aEribute 'Credit Amount' numeric ◦ @aEribute 'Value Savings/Stocks' {1,2,3,4,5} ◦ @aEribute 'Length of current employment' {1,2,3,4,5} ◦ @aEribute 'Instalment per cent' numeric ◦ @aEribute 'Sex & Marital Status' {1,2,3,4} ◦ @aEribute Guarantors {1,2,3} ◦ @aEribute 'DuraQon in Current address' {1,2,3,4} ◦ @aEribute 'Most valuable available asset' {1,2,3,4} ◦ @aEribute 'Age (years)' numeric ◦ @aEribute 'Concurrent Credits' {1,2,3} ◦ @aEribute 'Type of apartment' {1,2,3} ◦ @aEribute 'No of Credits at this Bank' numeric ◦ @aEribute OccupaQon {1,2,3,4} ◦ @aEribute 'No of dependents' numeric ◦ @aEribute Telephone {1,2} ◦ @aEribute 'Foreign Worker' {1,2} ◦ @aEribute Creditability {0,1} II. Find max, min, mean and standard deviaQon of aEributes. III. Determine any outlier values (records) for each of the aEributes or aEributes underconsideraQon are as follows for the ones that did have outliers. ```{r data,echo=FALSE,message=FALSE,warning=FALSE} ## ImporQng packages library(MASS) library(car) library(caret) library(randomForest) library(ROCR) library(e1071) ## Loading data into environment data <- read.csv("~/Desktop/FinalProject/CIND119/german_credit_card/german_credit.csv",header= T,stringsAsFactors = F, na.strings = c("","NA")) german=c("CreditStatus","Checking_Status","DuraRon","Credit_history", "Purpose" ,"Credit_amount", "Savings_status", "Employment", "Installment_Commitment", "Personal_status","Other_parRes", "Residence_since", "Property_Magnitude", "Age", "Other_payment_plans" ,"Housing", "ExisRng_credits","Job","Num_dependents","Own_telephone", "Foreign_worker" ) names(data) = german #Variables names in dataset names(data) str(data)
  • 3. table(data$credit.CreditStatus) table(data$credit.CreditStatus)/nrow(data) summary(data) summary(data$DuraRon) summary(data$Age) summary(data$Credit_amount ) boxplot(data$CreditStatus) boxplot(data$Checking_Status) boxplot(data$DuraRon) boxplot(data$Credit_history) boxplot(data$Purpose) boxplot(data$Credit_amount) boxplot(data$Savings_status) boxplot(data$Employment) boxplot(data$Installment_Commitment) boxplot(data$Personal_status) boxplot(data$Other_parRes) boxplot(data$Residence_since) boxplot(data$Property_Magnitude) boxplot(data$Age) boxplot(data$Other_payment_plans) boxplot(data$Housing) boxplot(data$ExisRng_credits) boxplot(data$Job) boxplot(data$Num_dependents) boxplot(data$Own_telephone) boxplot(data$Foreign_worker) boxplot.stats(data$CreditStatus) boxplot.stats(data$Checking_Status) boxplot.stats(data$DuraRon) boxplot.stats(data$Credit_history) boxplot.stats(data$Purpose) boxplot.stats(data$Credit_amount) boxplot.stats(data$Savings_status) boxplot.stats(data$Employment) boxplot.stats(data$Installment_Commitment) boxplot.stats(data$Personal_status) boxplot.stats(data$Other_parRes) boxplot.stats(data$Residence_since) boxplot.stats(data$Property_Magnitude) boxplot.stats(data$Age) boxplot.stats(data$Other_payment_plans) boxplot.stats(data$Housing) boxplot.stats(data$ExisRng_credits) boxplot.stats(data$Job) boxplot.stats(data$Num_dependents) boxplot.stats(data$Own_telephone) boxplot.stats(data$Foreign_worker) cor(data) write.csv(data, "final.csv") ``` > boxplot.stats(data$Duration) $stats [1] 4 12 18 24 42 $n
  • 4. [1] 1000 $conf [1] 17.40043 18.59957 $out [1] 48 48 48 48 48 48 48 48 47 48 60 54 48 48 60 48 60 48 48 60 48 48 48 48 48 48 60 60 45 [30] 48 48 48 72 48 60 60 48 60 60 48 60 48 48 48 60 45 48 48 48 48 48 48 48 60 48 48 48 48 [59] 48 48 45 48 54 48 45 48 45 48 48 48 > boxplot.stats(data$Purpose) $stats [1] 0 1 2 3 6 $n [1] 1000 $conf [1] 1.900072 2.099928 $out [1] 9 10 10 9 9 9 9 9 9 9 9 9 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 [30] 8 9 9 9 8 9 9 9 8 9 9 9 10 9 9 9 9 9 9 9 9 9 9 10 9 9 8 9 9 [59] 9 9 9 9 8 9 9 9 9 9 10 10 8 9 9 10 9 9 8 9 9 9 9 9 9 9 9 9 9 [88] 9 9 9 9 9 10 10 10 9 9 9 9 10 9 9 9 9 8 9 9 9 9 9 9 9 9 9 9 10 [117] 9 9 > boxplot.stats(data$Credit_amount) $stats [1] 250.0 1365.0 2319.5 3972.5 7882.0 $n [1] 1000 $conf [1] 2189.219 2449.781 $out [1] 10875 8858 12749 8072 8487 12169 10722 8613 8588 10366 8133 9436 10477 13756 [15] 11760 14179 10974 9566 8358 9857 10222 9055 7966 12204 8229 10623 9277 15857 [29] 10144 15653 8335 8471 8947 11054 9157 9283 14555 9271 8386 14318 15672 10961 [43] 7980 11560 11328 11938 14782 12612 9398 9572 8065 9034 14027 9629 12976 10297 [57] 14421 8086 10127 12389 11590 15945 9960 8648 8318 11816 11998 18424 14896 8978 [71] 12579 12680 > boxplot.stats(data$Other_parties) $stats [1] 1 1 1 1 1 $n [1] 1000 $conf [1] 1 1 $out [1] 3 2 3 3 3 2 2 3 2 2 2 2 3 2 2 3 2 2 2 3 2 2 3 3 2 3 3 2 3 3 3 2 2 3 3 3 3 3 2 3 3 2 2 3 [45] 2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 3 3 3 2 2 2 2 2 2 2 3 2 2 2 3 2 2 3 3 [89] 3 2 2 3 2 > boxplot.stats(data$Age) $stats [1] 19 27 33 42 64 $n [1] 1000 zconf [1] 32.25054 33.74946 $out
  • 5. [1] 65 74 74 74 65 66 68 66 66 70 67 65 75 66 75 67 65 67 65 66 74 68 68 > boxplot.stats(data$Other_payment_plans) $stats [1] 3 3 3 3 3 $n [1] 1000 $conf [1] 3 3 $out [1] 1 1 2 2 2 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [44] 1 1 2 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 1 2 2 2 1 1 2 2 1 1 [87] 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 1 2 2 2 1 1 1 2 1 1 1 1 1 2 2 1 1 2 2 1 [130] 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 2 1 1 [173] 1 1 1 2 1 1 1 2 1 1 1 1 2 1 > boxplot.stats(data$Housing) $stats [1] 2 2 2 2 2 $n [1] 1000 $conf [1] 2 2 $out [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 3 1 1 1 1 1 1 1 3 3 1 3 1 1 1 3 1 1 1 1 1 1 1 3 1 [44] 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 1 3 3 3 3 1 1 3 1 1 1 1 1 3 1 1 3 3 [87] 1 1 3 1 3 3 1 1 1 3 1 3 1 3 3 3 3 3 1 1 3 1 3 1 3 1 3 1 1 1 1 1 3 1 1 3 1 1 1 1 1 3 1 [130] 3 1 1 1 3 3 3 1 3 1 1 3 1 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 1 1 1 3 1 1 1 1 3 3 1 1 3 [173] 1 3 1 1 3 3 3 1 1 3 1 1 3 3 3 3 3 1 1 3 3 1 3 1 3 1 3 1 3 1 3 1 1 1 1 3 1 3 3 1 1 3 1 [216] 1 1 1 3 3 3 1 3 3 1 1 1 3 1 1 1 1 3 1 3 1 1 1 3 1 3 1 1 1 3 3 3 1 1 1 3 3 1 1 1 1 3 1 [259] 1 1 3 3 1 3 1 3 1 1 1 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 3 > boxplot.stats(data$Existing_credits) $stats [1] 1 1 1 2 3 $n [1] 1000 $conf [1] 0.950036 1.049964 $out [1] 4 4 4 4 4 4 > boxplot.stats(data$Job) $stats [1] 3 3 3 3 3 $n [1] 1000 $conf [1] 3 3 $out [1] 2 2 2 2 2 2 1 1 4 2 2 2 2 2 2 2 4 2 4 2 2 4 2 2 4 2 4 4 2 2 2 4 2 2 4 4 4 2 4 4 2 4 2 [44] 2 2 4 4 2 2 2 4 2 4 4 2 4 4 4 4 4 4 2 2 4 4 4 2 1 1 2 1 2 2 4 2 2 2 2 2 2 4 2 4 4 2 2 [87] 2 2 4 2 4 2 2 2 4 2 2 2 2 4 2 2 2 2 4 2 4 2 1 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 2 4 4 [130] 2 4 2 2 2 2 2 4 4 4 2 4 2 2 1 2 2 1 1 2 1 4 4 2 4 2 4 1 4 4 2 2 4 2 2 2 4 4 2 1 4 2 2 [173] 2 4 4 1 4 4 4 4 4 4 2 4 2 2 4 2 2 2 2 2 4 4 4 2 2 2 2 4 4 2 4 2 4 4 4 4 4 2 1 2 2 4 2 [216] 4 2 4 4 4 2 2 4 4 2 4 4 4 4 4 4 4 2 4 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2 1 2 2 2 2 [259] 2 2 4 1 2 4 4 4 2 4 4 4 4 4 4 2 2 2 2 2 4 2 1 2 2 4 4 2 4 4 4 4 2 2 2 1 2 4 4 2 4 4 2 [302] 2 4 2 4 4 4 4 2 2 4 4 2 4 4 2 2 2 2 4 2 2 4 4 4 1 4 4 4 4 4 4 4 2 2 2 2 2 4 2 2 2 4 2 [345] 4 2 2 2 2 1 1 2 2 2 2 4 4 2 4 4 4 4 4 2 2 1 4 2 4 4 > boxplot.stats(data$Num_dependents) $stats
  • 6. [1] 1 1 1 1 1 $n [1] 1000 $conf [1] 1 1 $out [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [44] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [87] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [130] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 > boxplot.stats(data$Foreign_worker) $stats [1] 1 1 1 1 1 $n [1] 1000 $conf [1] 1 1$out [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. I. Analyze the distribuQon of numeric aEributes (normal or other). Plot histograms for aEributes of concern and analyze whether they have any influence on the class aEribute. 1. The correlaQon funcQon used in R Studio as well as the scaEer matrix show which aEributes may be correlated to the class. II. Load the dataset in Weka and click on visualizaQon tab. Which aEributes seem to be correlated? Which aEributes seem to be most linked to the class aEribute? Inital Analysis: 1. Credit_amount 2. Age 3. Job 4. Savings_status 5. ExisQng_credits 6. Installment_commitment 7. Property_magnitude
  • 12. According to the above feature selecQon methods the following( in no order) seem to be most correlated: 1. Account Balance 2. Credit History 3. DuraQon 4. Value Savings/Stocks 5. Length of current employment 6. Concurrent Credits 7. Purpose
  • 13.
  • 14.
  • 15. I. Determine whether the dataset has an imbalanced class distribuRon (same proporRon of records of different types or not).
  • 16. • There was an imbalance in the CLASS with a raQo up to 40% hence the classes need to be balanced. This is common in datasets where fraudulent analysis is done. ◦ SMOTE was used to balance the classes and visualize all the classes that were balanced. I. Determine whether you need to handle missing values or transform any a]ributes. Weka filters (on the main tab) can be used for this purpose. • WEKA shows Missing Values(0%) hence I will not need to deal with those. The CLASS Values are already nominal & normalizing needs to be done. ◦ Normalize the numeric features so the range or scale is between 0 & 1 ◦ All values are correctly categorized as nominal or numeric.
  • 17. Predictive Modeling/ClassiSication I. Determine the right strategy for dataset split: simple training or tesQng, 10-fold cross validaQon, 3-fold cross validaQon, etc. II. InvesQgate the use of different parameters present in Weka for Decision Tree and compare your results obtained in different sevngs. Understand your decision trees generated by Weka. III. Repeat the same process for Naïve Bayes and the third classificaQon algorithm of your choice. I. The following algorithms were run & tested in order to determine the best algorithm for the dataset for the most accurate results. IV. Determine your performance measures (accuracy, recall, etc.). V. IdenQfy which algorithm performs well and in which sevngs.
  • 18.
  • 19.
  • 20. The highlighted Row above indicates the best parameters and algorithm for the best results as it was pretty close with the SVM Machine 100% SMOTE but according to the ROC Curve results Logistic Regression with 100% SMOTE is the best with 10 folds.. SMOTE Percentage Split Algorithm Accura cy SensiQvity Specificity Mean Abs Err RMS Err Rel Abs Err NO 80-20 LogisQc Regression 73 69.4 74.83 32.55 42.28 73.97 NO 80-20 MulQlayer PercepQon 72 64 74.6 27.3 48.1 62.1 NO 80-20 Decision Tree 73.5 69.38 74.6 32.33 48.57 73.48 NO 80-20 Support Vector Machine 75 72.9 75.6 25 50 56.82 100% 10FCV LogisQc Regression 76.3 74.66 77.68 30.11 39.88 60.57 100% 10FCV Support Vector Machine 76.84 74.8 78.5 23.15 48.12 46.58
  • 21. Post Predictive Modelling: ● InvesQgate the use of K-Means algorithm to segment the data of the predicted class of importance. ● Analyze each segment (group or cluster) and idenQfy the characterisQcs of customers (type of records) in each group; e.g., the characterisQcs of a group/cluster can be determined by finding the majority of aEributes in that group. ● Explain your interpretaQon of characterisQcs and state the recommendaQons for the organizaQon. Here are some suggesQons for paEern mining: ● Explore associaQon rules based paEerns for the records of the class of interest by using the Apriori algorithm in Weka on your dataset. ● You may have to use selected qualitaQve (categorical or ordinal) aEributes to discover paEerns. ● Try different values for minimum support and confidence, select the values that provide the appropriate number of rules and jusQfy your selecQon. ● IdenQfy the frequent and logically correct paEerns and state your recommendaQons for the organizaQon on different types of customers belonging to the predicted class. Describe your analysis for this secQon in your report
  • 22.
  • 23. Conclusions & Recommendations: We went through all the algorithm and models in WEKA.The best result we got is described above in tabular form. The best soluQon was the SVM 100% SMOTE with 10 fold cross-validaQon. But as can be seen in the ROC Curve comparison LogisQc SMOTE 100% 10 -fold cross-validaQon yields the highest at 0.8474. Hence this was the best model. It was used to get results in the output windows which was preEy close to results. The negaQves did not look accurate but that is because we are dealing with unbalanced dataset. 5.0 Conclusion ClassificaQon is a form of data analysis that extracts models describing important data classes. We have developed an effecQve and scalable model using SMOTE in collaboraQon with Support Vector Machine , LogisQc Regression, MulQple Layer Perceptacron and 10 FCV. We have evaluated the model using several metrics including accuracy, sensiQvity, Specificity, Mean Absolute Error, Root Mean Square Error and Real Absolute Error. 10-fold cross-validaQon is recommended for accuracy esQmaQon and Significance tests and ROC curves are useful for model selecQon.