SlideShare a Scribd company logo
CREDIT RISK PREDICTION
MODEL
Submitted by:
Rituparna Sarkar
Outline
1. Project Objective
2. Process Approach
3. Data Source and Variables
4. Data Analysis
5. Data Pre-processing
6. Exploratory Analysis
7. Model development
i. Training the model
ii. Validation
8. Conclusion & Limitations
Project Objective
To develop a prediction model to assess credit risk to
borrowers
• Do all borrowers have an equal probability to default?
• Is there a way to determine risk of defaulting before
processing a credit request?
• Can we classify customers into two groups, i.e.. Risky and
Non-Risky based on the nature of their financial data?
• Which are the key factors to be considered to assess risk
of lending to an individual based on historic data?
Process Approach
1. Develop a
predictive model
to assess the
credit risk to
Borrowers
2. Develop
business
understanding of
data, relationship
between variables
and data sources
to be used
1. Get data from
relevant data sources
2. Explore data for
missing values,
outliers, invalid data
through descriptive
statistics and
visualization
techniques
3. Understand the
business relevance of
outliers, missing
values and invalid data
and formulate the
approach to treat them
accordingly
1. Data splitting for
training and test
2. Data clean up for
missing values,
outliers, invalid data
3. Data binning and
imputation for
outlier treatment
4. Binning
independent
variables as per
business needs
5. Data exploration
for patterns and
collinearity test
1. Develop logistic
regression model
to classify
customers into two
groups based on
credit risk
probability
2. Train the model
using 80% of
training data
1. Validate the
trained model
using rest 20% of
training data
2. If satisfied with
accuracy
percentages
proceed to testing
using test dataset,
else go to
previous step
(modeling) and
train the model
again
When satisfied
with the test
results, deploy
the model to
aid business
take decisions
based on
predictions
given by the
model
Business
Understanding
Data
Understanding
Data
Preparation
Modeling DeploymentEvaluation
* Software Used – Excel & SPSS
Data Source and Variables
• Data source is a dataset with 2,50,000 records taken from Kaggle website. Dataset
was split into two parts – 1,50,000 cases for Training and validation and rest
1,00,000 cases for testing the model.
• Data Dictionary for variables in dataset:
Variable Name Description Type
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N
RevolvingUtilizationOfUnsecuredLines
Total balance on credit cards and personal lines of credit except real estate and no installment debt
like car loans divided by the sum of credit limits
percentage
age Age of borrower in years integer
NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer
DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage
MonthlyIncome Monthly income real
NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer
NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer
NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer
NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer
DATA ANALYSIS
Descriptive Statistics
• There are 1,50,000 cases in
training dataset;
• Out of 11 variables available,
SeriousDlqIn2yrs is the binary
dependent variable for
which model has to be
developed
• MonthlyIncome has large
number of missing values.
NumberOfDependents too
have some missing values
• There are high numbers of
extreme values(outliers) for
RevolvingUtilizationOfUnsecur
edLines, DebtRatio and
MonthlyIncome as indicated
by high Standard Deviation.
Missing Value Analysis
NumberOfDependents missing values are
about 2.6% (less than 5%) hence these
cases could be removed
MonthlyIncome has around 20%
value missing, which is quite high
and needs to be imputed
DATA PRE-PROCESSING
Data Cleaning Steps
Invalid Data identified below to be removed in the Excel sheet
• Age Variable - One case showing 0
• Variables NumberOfTime30-99DaysPastDueNotWorse, NumberOfTimes90DaysLate
and c)NumberOfTime60-089DaysPastDueNotWorse contains cases with values 96
and 98 which indicates ‘Don’t know’ and ‘Refused to Say’. They are very few in
number and common for all three variables.
Data Formatting in Excel
Variables RevolvingUtilizationOfUnsecuredLines and DebtRatio to be changed from
General to Number format
Imputation in SPSS:
• Imputation for missing values in MonthlyIncome
• 5 imputations done using all independent variables and 5th imputation results
taken for training
Descriptive Statistics After Data Cleaning
• After data cleaning
total number of cases
down to 145837
• Outliers in variables
DebtRatio,
MonthlyIncome and
RevolvingUtilizationOf
UnsecuredLines to be
removed through
binning
Variable Binning
Binning done for following variables:
• Age: Age Binning containing bins for age group
• DebtRatio & RevolvingUtilizationOfUnsecuredLines: Created variables
DebtRatio_Binning and RevolvingUtilizationOfUnsecuredLines_Binning with
following cut off values :
• MonthlyIncome: Variable MonthlyIncome_Binning with 5 equal width bins
Age Group Bin
21-30 1
31-40 2
41-50 3
51-60 4
>60 5
Group Bin Remark
<=0.25 1 Good
0.25 - 0.50 2 Low Risk
> 0.50 3 High Risk
EXPLORATORY ANALYSIS
Exploratory Analysis (Using SPSS)
Delinquencyoverdifferentcategories
0 1 0 1
21 - 30 7374 940 8314 5.42% 9.68% 5.70%
31 - 40 20562 2285 22847 15.11% 23.53% 15.67%
41 - 50 31130 2828 33958 22.87% 29.12% 23.28%
51 - 60 32334 2213 34547 23.75% 22.79% 23.69%
60 + 44725 1446 46171 32.86% 14.89% 31.66%
136125 9712 145837 100.00% 100.00% 100.00%
Age_Binni
ng
Total
SeriousDlqin2yrs %
Total
Age_Binning * SeriousDlqin2yrs Crosstabulation
Count
SeriousDlqin2yrs
Total
Disproportionate percentage of samples
for dependent variable. Sampling of
training dataset required to remove bias
in model development
• Maximum customers
from age group 60+
• Delinquency risk is
highest for Age Group of
41-50 and Lowest in 21-
30 age group
a) Age
Exploratory Analysis (Contd.)
Around 60% data have number of dependents as 0; Delinquency count and percentage
also highest for this group
Total percentage share of data with dependents greater than 3 is only around 2%
0 1 0 1
0 81722 4992 86714 60.03% 51.40% 59.46%
1 24372 1921 26293 17.90% 19.78% 18.03%
2 17930 1571 19501 13.17% 16.18% 13.37%
3 8646 833 9479 6.35% 8.58% 6.50%
4 2564 296 2860 1.88% 3.05% 1.96%
5 677 68 745 0.50% 0.70% 0.51%
6 134 24 158 0.10% 0.25% 0.11%
7 46 5 51 0.03% 0.05% 0.03%
8 22 2 24 0.02% 0.02% 0.02%
9 5 0 5 0.00% 0.00% 0.00%
10 5 0 5 0.00% 0.00% 0.00%
13 1 0 1 0.00% 0.00% 0.00%
20 1 0 1 0.00% 0.00% 0.00%
136125 9712 145837 100.00% 100.00% 100.00%
Num berOf
Dependen
ts
Total
Serious Dlqin2yrs %
Total
NumberOfDependents * SeriousDlqin2yrs
Crosstabulation
Count
Serious Dlqin2yrs
Total
b) Number of Dependents
Exploratory Analysis (Contd.)
0 1 0 1
<= 0.25 24825 1472 26297 36.47% 30.31% 36.06%
0.26 - 0.50 19181 1256 20437 28.18% 25.86% 28.03%
0.51+ 24057 2128 26185 35.35% 43.82% 35.91%
68063 4856 72919 100.00% 100.00% 100.00%
SeriousDlqin2yrs %
Total
DebtRatio
(Binned)
Total
DebtRatio (Binned) * SeriousDlqin2yrs Crosstabulation
Count
SeriousDlqin2yrs
Total
0 1 0 1
<= 0.25 41954 912 42866 61.64% 18.78% 58.79%
0.26 - 0.50 9680 573 10253 14.22% 11.80% 14.06%
0.51+ 16429 3371 19800 24.14% 69.42% 27.15%
68063 4856 72919 100.00% 100.00% 100.00%
SeriousDlqin2yrs %
Total
RevolvingUtilizationOfUnsecuredLines (Binned) *
SeriousDlqin2yrs Crosstabulation
Count
SeriousDlqin2yrs
Total
Revolving
Utilization
OfUnsecur
edLines
(Binned)
Total
Around 44% of Delinquency from group with Debt Ratio > 0.5
Around 69% of Delinquency from group with RevolvingUtilizationOfCreditLines > 0.5
d) RevolvingUtilizationOfCreditLines
c) Debt Ratio
Exploratory Analysis (Contd.)
0 1 0 1
<=
3100.00
26699 2494 29193 19.61% 25.68% 20.02%
3100.01 -
5000.00
29083 2518 31601 21.36% 25.93% 21.67%
5000.01 -
7083.00
25214 1766 26980 18.52% 18.18% 18.50%
7083.01 -
10823.00
27435 1461 28896 20.15% 15.04% 19.81%
10823.01+ 27694 1473 29167 20.34% 15.17% 20.00%
136125 9712 145837 100.00% 100.00% 100.00%
SeriousDlqin2yrs
Total
MonthlyInc
ome
(Binned)
Total
SeriousDlqin2yrs %
Total
MonthlyIncome (Binned) * SeriousDlqin2yrs
Crosstabulation
Count
• More than 50% of defaulters are accounted by lower 40% of the income range
• Other 3 groups have more or less same percentage of defaulters
e) Monthly Income
Exploratory Analysis
MonthlyIncomevs. OtherFinancialVariables
Exploratory Analysis (Contd.)
All parameters below have similar pattern - low
income range attributing to high values of debt
indicators
i) RevolvingUtilizationOfUnecuredLines,
ii) DebtRatio,
iii) NumberOfTime30-59DaysPastDueNotWorse,
iv) NumberOfTimes90DaysLate
v) NumberOfTime60-089DaysPastDueNotWorse,
vi) NumberOfOpenCreditLinesAndLoans
vii) NumberOfRealEstateLoansOrLines
Collinearity Diagnostics
Sample Collinearity Diagnostic results for Age
vs. Other 9 independent variable shown here
Performed similar diagnostics for each of the
10 variable against other variables
Condition Index was always less than 15
indicating no collinearity is existing between
independent variables
MODEL DEVELOPMENT
Logistic Regression Model
 The model is developed to classify the SeriousDlqin2yrs variable as 1 or 0
• 1 indicates risk of defaulting
• 0 indicates no risk
 As the proportion of cases with SeriousDlqin2yrs = 1 is just 6.7 % of the total, a 50:50 strata sampling approach is
followed to come up with the model
 Pre-processed training dataset is used to draw samples for training and validation of the model
 80% random samples drawn from training dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1
and used for developing and training the model
 20% random samples drawn from same dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1 and
used to for validation
 Final model tested using test data set given
 Logistic regression models were developed and compared with two different approaches:
• With binned variables (Model 1)
• Binned model as Model 1, but missing data binned into another category instead of clean up/imputation,
wherever applicable(Model 2)
• A model without binning using variables directly (Model 3)
MODEL 1 – WITH BINNING
Model 1 – With binning
• The model has been developed considering business needs and therefore the bins have been
created considering business cut offs.
• In the current model, missing values for NoOfDependents, NumberOfTime30-
99DaysPastDueNotWorse, NumberOfTimes90DaysLate and NumberOfTime60-
089DaysPastDueNotWorse variables have been removed as they formed 2% of the data and
missing values in MonthlyIncome have been imputed.
• Since RevolvingUtilizationUsingUnsecuredLines and DebtRatio are percentages for which bins have
been created. Bins created for Age variable as well.
• Dummy variables were created for the categories in the binned variables clubbing insignificant
bins together to have better control of the model.
• Training dataset comprised of stratified sample of 9000 records (4500 SeriousDlquin2Yrs = 1 and
4500 SeriousDlquin2Yrs = 0).
• The model comprises of 10 variables including 4 dummy variables.
Model 1 – Output
• The logit function equation for the model is :
-(0.595)+(0.597)* NumberOfTime3059DaysPastDueNotWorse+ (1.029)*
NumberOfTimes90DaysLate + (0.072)* NumberRealEstateLoansOrLines + (0.862)*
NumberOfTime6089DaysPastDueNotWorse + (0.030)* NumberOfOpenCreditLinesAndLoans
– (0.025)*Age + (0.825) * RU_0_.25(1)+ (0.689)* RU_0(1) – (0.783)* RU_GT_.5(1) + (0.129)*
DebtRatio_GT0.25_0.5(1)
• A cut off value of 0.5 gave optimal results
Model 1 - Variables Used
 Variables used
• Age
• NumberOfTime3059DaysPastDueNotWorse
• NumberOfTime6089DaysPastDueNotWorse
• NumberOfTimes90DaysLate
• NumberOfOpenCreditLinesAndLoans
• NumberRealEstateLoansOrLines
• DebtRatio – Dummy Variable used with range of DebtRatio >= 0.25 & <0.5
• RevolvingUtilizationOfUnsecuredLines – Used 3 Dummy Variables : RU_0 (where RU=0), RU_0_.25(
where RU>0 but <0.25) and RU_GT_.5( where RU >=5).
 Observations
• MonthlyIncome was a significant variable but had a Beta Co-efficient of 0 therefore dropped from
the model.
• MonthlyIncome and DebtRatio were affecting each other
• RevolvingUtilizationOfUnsecuredCreditLines and DebtRatio seems to be correlated.
• Though bins were created for Age variable but all the bins were contributing equally to the model
therefore used the Age variable as such.
• NoOfDependents was initially thought as significant variable but turned out to be insignificant.
Created bins for NoOfDependents variable but the bins too were insignificant.
Model 1 - Validation
• Validated the developed model on a non- stratified random sample of 40% of the data (which
comprised of 29168 records).
• Overall accuracy : 78.62% and Misclassification rate : 21.38%
• Prediction accuracy for Risky (= 1) is 75.9%
Model 1 – Pros and Cons
 17% of the missing values has been imputed and only 2% has been removed, thereby data loss is minimal.
 The model has been developed taking into consideration widely used business cut offs and significant
parameters.
 Since the model has been built on data where missing values were treated, the accuracy of the model may drop
on data where missing values are present.
 Analyzing Top 10% ( Customers who are prone to default)
• 67.4% defaulters are in the age group : 30-50
• 67% of defaulters had Revolving Utilization and Debt Ratio less than 0.5
• 70.6 %, 78.7% and 74% of the defaulters made payments on time and did not go past 30 days, 60 days and
90 days respectively.
• 70% of the defaulters had Monthly Income less than or equal to 7466 USD and 73.3 % of the defaulters did
not have any dependent.
 Analyzing Bottom 10% ( Customers who are safe)
• 80 % of non- defaulters are more than 40 years of age.
• 61% of non- defaulters had Revolving Utilization and Debt Ratio less than 0.5
• 85 %, 96.9% and 97.5% of the non- defaulters made payments on time and did not go past 30 days, 60 days
and 90 days respectively.
• 70% of the non- defaulters had Monthly Income less than or equal to 8366 USD and 50.4 % of the non-
defaulters did not have any dependent.
MODEL 2 – CONSIDERING
MISSING VALUES
Model 2 – Considering Missing Values
• Missing values have not been imputed here, rather an extra category has been added in
the binned variables to consider missing value as another category. (Example :
NoOfDependents_Binned shown below)
• Selection of variables have been based on B, Exp(B), Sig values
• Optimal Binning has been used based on SeriousDlquin2yrs variable.
Model 2 – Output
• Final Model
(1.311*Age_1)+(1.107*Age_2)+(0.898*Age_3)+(0.479*Age_4)+(1.802*NoOf30_1)+(2.971*NoOf30_2)+(3.445
*NoOf30_3)+(3.858*NoOf30_4)+(4.001*NoOf30_5)+(-1.784*NoOf60_1)+(-0.362*NoOf60_2)+(-
3.125*NoOf90_1)+(-1.311*NoOf90_2)+(-0.549*NoOf90_3)+1.442.
• Training Set – Stratified sampling of 4000 records with SeriousDlquin2Yrs = 1 and another
4000 with SeriousDlquin2Yrs = 0
• A cut off value of 0.4 gave optimal results
Model 2 - Variables Used
 Variables used
• Age_OptimalBin
• NumberOfTime3059DaysPastDueNotWorse_OptimalBin
• NumberOfTime6089DaysPastDueNotWorse_OptimalBin
• NumberOfTimes90DaysLate_OptimalBin
 Possible reasons why few other variables are not significant
• Age has a non-linear relationship with MonthlyIncome
• Other 3 variables in the equation are the indicators of number of defaults committed by
the customer which has a relation with NumberOfOpenLinesOfCredit and
RevolvingUtilizationsOfUnsecuredLines
• MonthlyIncome will effect the DebtRatio
Model 2 - Validation
• Multiple test run has been performed on different sample sizes
• The below given validation table was for a random sample of 90000.
• Overall Accuracy 72.62% and Misclassification 27.37%
• Risky ( = 1) prediction accuracy of 75.1%
Model 2 – Pros and Cons
 Capable of handling missing values (including 98,96)
 Intermediate processing required is minimal (only binning required)
 The model uses only 4 variables
 Optimal binning used and not the industry standard binning
 Other insights
• Analyzing top 10% (most risky customer segment)
84% of the customer are below 56 years of age
72% have 1 or more past 30 days default
• Analyzing bottom 10% (safest customer segment)
All of them are of 64 years or above in age
Almost all of them have 0 defaults under any case.
MODEL 3 – USING
VARIABLES DIRECTLY
Model 3 – Using Variables Directly
• Final model has following equation:
0.754+(0.031*Age)+(0.766*NumberOfTime3059DaysPastDueNotWorse)+(1.179*NumberOf
Time6089DaysPastDueNotWorse)+(1.417*NumberOfTimes90DaysLate)
• This model is simplest but business considerations were not accounted for, hence cannot
assure robustness on deployment
• It cannot handle missing values
CONCLUSION & LIMITATIONS
Conclusion & Limitations
• Model 1 and Model 2 give similar accuracy levels. Model 3 is not
recommended. Choice of final model is left to business based on the
pros and cons mentioned
• These models to be further validated for scalability and robustness
• The test dataset given did not have delinquency values; hence after
validation with 20% random samples from training data set further
validation could not be performed using test dataset for accuracy
check on a totally new set of data.
• Assumptions taken on binning financial variables could change the
significance of different variables in final model. This aspect to be
validated with business
THANK YOU

More Related Content

What's hot

project on credit-risk-management
project on credit-risk-managementproject on credit-risk-management
project on credit-risk-management
Shanky Rana
 
Best Practice EAD Modelling Methodologies v1.4
Best Practice EAD Modelling Methodologies v1.4Best Practice EAD Modelling Methodologies v1.4
Best Practice EAD Modelling Methodologies v1.4
David Ong
 
Study on credit risk management of SBI Cochi
Study on credit risk management of SBI CochiStudy on credit risk management of SBI Cochi
Study on credit risk management of SBI Cochi
Sreelakshmi_S
 
A study of credit risk management in commercial banks
A study of credit risk management in commercial banksA study of credit risk management in commercial banks
A study of credit risk management in commercial banks
WriteKraft Dissertations
 
"Credit Risk-Probabilities Of Default"
"Credit Risk-Probabilities Of Default""Credit Risk-Probabilities Of Default"
"Credit Risk-Probabilities Of Default"
Arun Singh
 

What's hot (20)

Credit Risk Management ppt
Credit Risk Management pptCredit Risk Management ppt
Credit Risk Management ppt
 
Expert Judgement Credit Rating for SME & Commercial Customers
Expert Judgement Credit Rating for SME & Commercial CustomersExpert Judgement Credit Rating for SME & Commercial Customers
Expert Judgement Credit Rating for SME & Commercial Customers
 
EAD Model
EAD ModelEAD Model
EAD Model
 
Credit defaulter analysis
Credit defaulter analysisCredit defaulter analysis
Credit defaulter analysis
 
Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients
 
Predicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditPredicting Delinquency-Give me some credit
Predicting Delinquency-Give me some credit
 
Credit scoring
Credit scoringCredit scoring
Credit scoring
 
Kaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewKaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overview
 
project on credit-risk-management
project on credit-risk-managementproject on credit-risk-management
project on credit-risk-management
 
Safeguarding Bank Assets with an Early Warning System
Safeguarding Bank Assets with an Early Warning SystemSafeguarding Bank Assets with an Early Warning System
Safeguarding Bank Assets with an Early Warning System
 
Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random Forest
 
Credit Risk Management
Credit Risk ManagementCredit Risk Management
Credit Risk Management
 
Predictive Model for Loan Approval Process using SAS 9.3_M1
Predictive Model for Loan Approval Process using SAS 9.3_M1Predictive Model for Loan Approval Process using SAS 9.3_M1
Predictive Model for Loan Approval Process using SAS 9.3_M1
 
Predicting Credit Card Defaults using Machine Learning Algorithms
Predicting Credit Card Defaults using Machine Learning AlgorithmsPredicting Credit Card Defaults using Machine Learning Algorithms
Predicting Credit Card Defaults using Machine Learning Algorithms
 
CREDIT APPRESIAL
CREDIT APPRESIALCREDIT APPRESIAL
CREDIT APPRESIAL
 
CREDIT RISK MANAGEMENT IN BANKING: A CASE FOR CREDIT FRIENDLINESS
CREDIT RISK MANAGEMENT IN BANKING: A CASE FOR CREDIT FRIENDLINESSCREDIT RISK MANAGEMENT IN BANKING: A CASE FOR CREDIT FRIENDLINESS
CREDIT RISK MANAGEMENT IN BANKING: A CASE FOR CREDIT FRIENDLINESS
 
Best Practice EAD Modelling Methodologies v1.4
Best Practice EAD Modelling Methodologies v1.4Best Practice EAD Modelling Methodologies v1.4
Best Practice EAD Modelling Methodologies v1.4
 
Study on credit risk management of SBI Cochi
Study on credit risk management of SBI CochiStudy on credit risk management of SBI Cochi
Study on credit risk management of SBI Cochi
 
A study of credit risk management in commercial banks
A study of credit risk management in commercial banksA study of credit risk management in commercial banks
A study of credit risk management in commercial banks
 
"Credit Risk-Probabilities Of Default"
"Credit Risk-Probabilities Of Default""Credit Risk-Probabilities Of Default"
"Credit Risk-Probabilities Of Default"
 

Similar to Credit risk scoring model final

Similar to Credit risk scoring model final (20)

Business and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April MeetupBusiness and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April Meetup
 
Maintaining Credit Quality in Banks and Credit Unions
Maintaining Credit Quality in Banks and Credit UnionsMaintaining Credit Quality in Banks and Credit Unions
Maintaining Credit Quality in Banks and Credit Unions
 
Forward-Looking ALLL: Computing Qualitative Adjustments
Forward-Looking ALLL: Computing Qualitative AdjustmentsForward-Looking ALLL: Computing Qualitative Adjustments
Forward-Looking ALLL: Computing Qualitative Adjustments
 
Choosing The Right Credit Decisioning Model
Choosing The Right Credit Decisioning ModelChoosing The Right Credit Decisioning Model
Choosing The Right Credit Decisioning Model
 
Moody's ---How Social Performance Impacts Financial Resilience and Default Pr...
Moody's ---How Social Performance Impacts Financial Resilience and Default Pr...Moody's ---How Social Performance Impacts Financial Resilience and Default Pr...
Moody's ---How Social Performance Impacts Financial Resilience and Default Pr...
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom Industry
 
Evaluation of transport safety policies in commercial motorcycle operation in...
Evaluation of transport safety policies in commercial motorcycle operation in...Evaluation of transport safety policies in commercial motorcycle operation in...
Evaluation of transport safety policies in commercial motorcycle operation in...
 
Personal Loan Risk Assessment
Personal Loan Risk Assessment Personal Loan Risk Assessment
Personal Loan Risk Assessment
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?
 
Barclays - Case Study Competition | ISB | National Finalist
Barclays - Case Study Competition | ISB | National FinalistBarclays - Case Study Competition | ISB | National Finalist
Barclays - Case Study Competition | ISB | National Finalist
 
Credit Risk and Monetary Pass-through. Evidence from Chile
Credit Risk and Monetary Pass-through. Evidence from ChileCredit Risk and Monetary Pass-through. Evidence from Chile
Credit Risk and Monetary Pass-through. Evidence from Chile
 
Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage Industry
 
Loan Risk Assessment & Scoring Model
Loan Risk Assessment & Scoring ModelLoan Risk Assessment & Scoring Model
Loan Risk Assessment & Scoring Model
 
Asa wisconsin chapter april 2015 meeting presentation: residual values for ma...
Asa wisconsin chapter april 2015 meeting presentation: residual values for ma...Asa wisconsin chapter april 2015 meeting presentation: residual values for ma...
Asa wisconsin chapter april 2015 meeting presentation: residual values for ma...
 
RMCPWSM_GCM_2015
RMCPWSM_GCM_2015RMCPWSM_GCM_2015
RMCPWSM_GCM_2015
 
Cas rpm 2015 claim liability estimation
Cas rpm 2015   claim liability estimationCas rpm 2015   claim liability estimation
Cas rpm 2015 claim liability estimation
 
[DSC Adria 23] Mirjana Pejic Bach Data mining approach to internal fraud in a...
[DSC Adria 23] Mirjana Pejic Bach Data mining approach to internal fraud in a...[DSC Adria 23] Mirjana Pejic Bach Data mining approach to internal fraud in a...
[DSC Adria 23] Mirjana Pejic Bach Data mining approach to internal fraud in a...
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
 
Ac Sjzh92177
Ac Sjzh92177Ac Sjzh92177
Ac Sjzh92177
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 

More from Ritu Sarkar (9)

Google analytics
Google analyticsGoogle analytics
Google analytics
 
Candy score score
Candy score scoreCandy score score
Candy score score
 
Simulation model sortation system
Simulation model sortation systemSimulation model sortation system
Simulation model sortation system
 
La liga 2013 2014 analysis
La liga 2013 2014 analysisLa liga 2013 2014 analysis
La liga 2013 2014 analysis
 
Driver profile caused accident
Driver profile caused accidentDriver profile caused accident
Driver profile caused accident
 
Kaggel cab serivce
Kaggel cab serivceKaggel cab serivce
Kaggel cab serivce
 
Big Data solution for multi-national Bank
Big Data solution for multi-national BankBig Data solution for multi-national Bank
Big Data solution for multi-national Bank
 
Data mining to improve e-mail marketing
Data mining to improve e-mail marketing Data mining to improve e-mail marketing
Data mining to improve e-mail marketing
 
Best analytics tool
 Best analytics tool Best analytics tool
Best analytics tool
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

Credit risk scoring model final

  • 2. Outline 1. Project Objective 2. Process Approach 3. Data Source and Variables 4. Data Analysis 5. Data Pre-processing 6. Exploratory Analysis 7. Model development i. Training the model ii. Validation 8. Conclusion & Limitations
  • 3. Project Objective To develop a prediction model to assess credit risk to borrowers • Do all borrowers have an equal probability to default? • Is there a way to determine risk of defaulting before processing a credit request? • Can we classify customers into two groups, i.e.. Risky and Non-Risky based on the nature of their financial data? • Which are the key factors to be considered to assess risk of lending to an individual based on historic data?
  • 4. Process Approach 1. Develop a predictive model to assess the credit risk to Borrowers 2. Develop business understanding of data, relationship between variables and data sources to be used 1. Get data from relevant data sources 2. Explore data for missing values, outliers, invalid data through descriptive statistics and visualization techniques 3. Understand the business relevance of outliers, missing values and invalid data and formulate the approach to treat them accordingly 1. Data splitting for training and test 2. Data clean up for missing values, outliers, invalid data 3. Data binning and imputation for outlier treatment 4. Binning independent variables as per business needs 5. Data exploration for patterns and collinearity test 1. Develop logistic regression model to classify customers into two groups based on credit risk probability 2. Train the model using 80% of training data 1. Validate the trained model using rest 20% of training data 2. If satisfied with accuracy percentages proceed to testing using test dataset, else go to previous step (modeling) and train the model again When satisfied with the test results, deploy the model to aid business take decisions based on predictions given by the model Business Understanding Data Understanding Data Preparation Modeling DeploymentEvaluation * Software Used – Excel & SPSS
  • 5. Data Source and Variables • Data source is a dataset with 2,50,000 records taken from Kaggle website. Dataset was split into two parts – 1,50,000 cases for Training and validation and rest 1,00,000 cases for testing the model. • Data Dictionary for variables in dataset: Variable Name Description Type SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits percentage age Age of borrower in years integer NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage MonthlyIncome Monthly income real NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer
  • 7. Descriptive Statistics • There are 1,50,000 cases in training dataset; • Out of 11 variables available, SeriousDlqIn2yrs is the binary dependent variable for which model has to be developed • MonthlyIncome has large number of missing values. NumberOfDependents too have some missing values • There are high numbers of extreme values(outliers) for RevolvingUtilizationOfUnsecur edLines, DebtRatio and MonthlyIncome as indicated by high Standard Deviation.
  • 8. Missing Value Analysis NumberOfDependents missing values are about 2.6% (less than 5%) hence these cases could be removed MonthlyIncome has around 20% value missing, which is quite high and needs to be imputed
  • 10. Data Cleaning Steps Invalid Data identified below to be removed in the Excel sheet • Age Variable - One case showing 0 • Variables NumberOfTime30-99DaysPastDueNotWorse, NumberOfTimes90DaysLate and c)NumberOfTime60-089DaysPastDueNotWorse contains cases with values 96 and 98 which indicates ‘Don’t know’ and ‘Refused to Say’. They are very few in number and common for all three variables. Data Formatting in Excel Variables RevolvingUtilizationOfUnsecuredLines and DebtRatio to be changed from General to Number format Imputation in SPSS: • Imputation for missing values in MonthlyIncome • 5 imputations done using all independent variables and 5th imputation results taken for training
  • 11. Descriptive Statistics After Data Cleaning • After data cleaning total number of cases down to 145837 • Outliers in variables DebtRatio, MonthlyIncome and RevolvingUtilizationOf UnsecuredLines to be removed through binning
  • 12. Variable Binning Binning done for following variables: • Age: Age Binning containing bins for age group • DebtRatio & RevolvingUtilizationOfUnsecuredLines: Created variables DebtRatio_Binning and RevolvingUtilizationOfUnsecuredLines_Binning with following cut off values : • MonthlyIncome: Variable MonthlyIncome_Binning with 5 equal width bins Age Group Bin 21-30 1 31-40 2 41-50 3 51-60 4 >60 5 Group Bin Remark <=0.25 1 Good 0.25 - 0.50 2 Low Risk > 0.50 3 High Risk
  • 14. Exploratory Analysis (Using SPSS) Delinquencyoverdifferentcategories 0 1 0 1 21 - 30 7374 940 8314 5.42% 9.68% 5.70% 31 - 40 20562 2285 22847 15.11% 23.53% 15.67% 41 - 50 31130 2828 33958 22.87% 29.12% 23.28% 51 - 60 32334 2213 34547 23.75% 22.79% 23.69% 60 + 44725 1446 46171 32.86% 14.89% 31.66% 136125 9712 145837 100.00% 100.00% 100.00% Age_Binni ng Total SeriousDlqin2yrs % Total Age_Binning * SeriousDlqin2yrs Crosstabulation Count SeriousDlqin2yrs Total Disproportionate percentage of samples for dependent variable. Sampling of training dataset required to remove bias in model development • Maximum customers from age group 60+ • Delinquency risk is highest for Age Group of 41-50 and Lowest in 21- 30 age group a) Age
  • 15. Exploratory Analysis (Contd.) Around 60% data have number of dependents as 0; Delinquency count and percentage also highest for this group Total percentage share of data with dependents greater than 3 is only around 2% 0 1 0 1 0 81722 4992 86714 60.03% 51.40% 59.46% 1 24372 1921 26293 17.90% 19.78% 18.03% 2 17930 1571 19501 13.17% 16.18% 13.37% 3 8646 833 9479 6.35% 8.58% 6.50% 4 2564 296 2860 1.88% 3.05% 1.96% 5 677 68 745 0.50% 0.70% 0.51% 6 134 24 158 0.10% 0.25% 0.11% 7 46 5 51 0.03% 0.05% 0.03% 8 22 2 24 0.02% 0.02% 0.02% 9 5 0 5 0.00% 0.00% 0.00% 10 5 0 5 0.00% 0.00% 0.00% 13 1 0 1 0.00% 0.00% 0.00% 20 1 0 1 0.00% 0.00% 0.00% 136125 9712 145837 100.00% 100.00% 100.00% Num berOf Dependen ts Total Serious Dlqin2yrs % Total NumberOfDependents * SeriousDlqin2yrs Crosstabulation Count Serious Dlqin2yrs Total b) Number of Dependents
  • 16. Exploratory Analysis (Contd.) 0 1 0 1 <= 0.25 24825 1472 26297 36.47% 30.31% 36.06% 0.26 - 0.50 19181 1256 20437 28.18% 25.86% 28.03% 0.51+ 24057 2128 26185 35.35% 43.82% 35.91% 68063 4856 72919 100.00% 100.00% 100.00% SeriousDlqin2yrs % Total DebtRatio (Binned) Total DebtRatio (Binned) * SeriousDlqin2yrs Crosstabulation Count SeriousDlqin2yrs Total 0 1 0 1 <= 0.25 41954 912 42866 61.64% 18.78% 58.79% 0.26 - 0.50 9680 573 10253 14.22% 11.80% 14.06% 0.51+ 16429 3371 19800 24.14% 69.42% 27.15% 68063 4856 72919 100.00% 100.00% 100.00% SeriousDlqin2yrs % Total RevolvingUtilizationOfUnsecuredLines (Binned) * SeriousDlqin2yrs Crosstabulation Count SeriousDlqin2yrs Total Revolving Utilization OfUnsecur edLines (Binned) Total Around 44% of Delinquency from group with Debt Ratio > 0.5 Around 69% of Delinquency from group with RevolvingUtilizationOfCreditLines > 0.5 d) RevolvingUtilizationOfCreditLines c) Debt Ratio
  • 17. Exploratory Analysis (Contd.) 0 1 0 1 <= 3100.00 26699 2494 29193 19.61% 25.68% 20.02% 3100.01 - 5000.00 29083 2518 31601 21.36% 25.93% 21.67% 5000.01 - 7083.00 25214 1766 26980 18.52% 18.18% 18.50% 7083.01 - 10823.00 27435 1461 28896 20.15% 15.04% 19.81% 10823.01+ 27694 1473 29167 20.34% 15.17% 20.00% 136125 9712 145837 100.00% 100.00% 100.00% SeriousDlqin2yrs Total MonthlyInc ome (Binned) Total SeriousDlqin2yrs % Total MonthlyIncome (Binned) * SeriousDlqin2yrs Crosstabulation Count • More than 50% of defaulters are accounted by lower 40% of the income range • Other 3 groups have more or less same percentage of defaulters e) Monthly Income
  • 19. Exploratory Analysis (Contd.) All parameters below have similar pattern - low income range attributing to high values of debt indicators i) RevolvingUtilizationOfUnecuredLines, ii) DebtRatio, iii) NumberOfTime30-59DaysPastDueNotWorse, iv) NumberOfTimes90DaysLate v) NumberOfTime60-089DaysPastDueNotWorse, vi) NumberOfOpenCreditLinesAndLoans vii) NumberOfRealEstateLoansOrLines
  • 20. Collinearity Diagnostics Sample Collinearity Diagnostic results for Age vs. Other 9 independent variable shown here Performed similar diagnostics for each of the 10 variable against other variables Condition Index was always less than 15 indicating no collinearity is existing between independent variables
  • 22. Logistic Regression Model  The model is developed to classify the SeriousDlqin2yrs variable as 1 or 0 • 1 indicates risk of defaulting • 0 indicates no risk  As the proportion of cases with SeriousDlqin2yrs = 1 is just 6.7 % of the total, a 50:50 strata sampling approach is followed to come up with the model  Pre-processed training dataset is used to draw samples for training and validation of the model  80% random samples drawn from training dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1 and used for developing and training the model  20% random samples drawn from same dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1 and used to for validation  Final model tested using test data set given  Logistic regression models were developed and compared with two different approaches: • With binned variables (Model 1) • Binned model as Model 1, but missing data binned into another category instead of clean up/imputation, wherever applicable(Model 2) • A model without binning using variables directly (Model 3)
  • 23. MODEL 1 – WITH BINNING
  • 24. Model 1 – With binning • The model has been developed considering business needs and therefore the bins have been created considering business cut offs. • In the current model, missing values for NoOfDependents, NumberOfTime30- 99DaysPastDueNotWorse, NumberOfTimes90DaysLate and NumberOfTime60- 089DaysPastDueNotWorse variables have been removed as they formed 2% of the data and missing values in MonthlyIncome have been imputed. • Since RevolvingUtilizationUsingUnsecuredLines and DebtRatio are percentages for which bins have been created. Bins created for Age variable as well. • Dummy variables were created for the categories in the binned variables clubbing insignificant bins together to have better control of the model. • Training dataset comprised of stratified sample of 9000 records (4500 SeriousDlquin2Yrs = 1 and 4500 SeriousDlquin2Yrs = 0). • The model comprises of 10 variables including 4 dummy variables.
  • 25. Model 1 – Output • The logit function equation for the model is : -(0.595)+(0.597)* NumberOfTime3059DaysPastDueNotWorse+ (1.029)* NumberOfTimes90DaysLate + (0.072)* NumberRealEstateLoansOrLines + (0.862)* NumberOfTime6089DaysPastDueNotWorse + (0.030)* NumberOfOpenCreditLinesAndLoans – (0.025)*Age + (0.825) * RU_0_.25(1)+ (0.689)* RU_0(1) – (0.783)* RU_GT_.5(1) + (0.129)* DebtRatio_GT0.25_0.5(1) • A cut off value of 0.5 gave optimal results
  • 26. Model 1 - Variables Used  Variables used • Age • NumberOfTime3059DaysPastDueNotWorse • NumberOfTime6089DaysPastDueNotWorse • NumberOfTimes90DaysLate • NumberOfOpenCreditLinesAndLoans • NumberRealEstateLoansOrLines • DebtRatio – Dummy Variable used with range of DebtRatio >= 0.25 & <0.5 • RevolvingUtilizationOfUnsecuredLines – Used 3 Dummy Variables : RU_0 (where RU=0), RU_0_.25( where RU>0 but <0.25) and RU_GT_.5( where RU >=5).  Observations • MonthlyIncome was a significant variable but had a Beta Co-efficient of 0 therefore dropped from the model. • MonthlyIncome and DebtRatio were affecting each other • RevolvingUtilizationOfUnsecuredCreditLines and DebtRatio seems to be correlated. • Though bins were created for Age variable but all the bins were contributing equally to the model therefore used the Age variable as such. • NoOfDependents was initially thought as significant variable but turned out to be insignificant. Created bins for NoOfDependents variable but the bins too were insignificant.
  • 27. Model 1 - Validation • Validated the developed model on a non- stratified random sample of 40% of the data (which comprised of 29168 records). • Overall accuracy : 78.62% and Misclassification rate : 21.38% • Prediction accuracy for Risky (= 1) is 75.9%
  • 28. Model 1 – Pros and Cons  17% of the missing values has been imputed and only 2% has been removed, thereby data loss is minimal.  The model has been developed taking into consideration widely used business cut offs and significant parameters.  Since the model has been built on data where missing values were treated, the accuracy of the model may drop on data where missing values are present.  Analyzing Top 10% ( Customers who are prone to default) • 67.4% defaulters are in the age group : 30-50 • 67% of defaulters had Revolving Utilization and Debt Ratio less than 0.5 • 70.6 %, 78.7% and 74% of the defaulters made payments on time and did not go past 30 days, 60 days and 90 days respectively. • 70% of the defaulters had Monthly Income less than or equal to 7466 USD and 73.3 % of the defaulters did not have any dependent.  Analyzing Bottom 10% ( Customers who are safe) • 80 % of non- defaulters are more than 40 years of age. • 61% of non- defaulters had Revolving Utilization and Debt Ratio less than 0.5 • 85 %, 96.9% and 97.5% of the non- defaulters made payments on time and did not go past 30 days, 60 days and 90 days respectively. • 70% of the non- defaulters had Monthly Income less than or equal to 8366 USD and 50.4 % of the non- defaulters did not have any dependent.
  • 29. MODEL 2 – CONSIDERING MISSING VALUES
  • 30. Model 2 – Considering Missing Values • Missing values have not been imputed here, rather an extra category has been added in the binned variables to consider missing value as another category. (Example : NoOfDependents_Binned shown below) • Selection of variables have been based on B, Exp(B), Sig values • Optimal Binning has been used based on SeriousDlquin2yrs variable.
  • 31. Model 2 – Output • Final Model (1.311*Age_1)+(1.107*Age_2)+(0.898*Age_3)+(0.479*Age_4)+(1.802*NoOf30_1)+(2.971*NoOf30_2)+(3.445 *NoOf30_3)+(3.858*NoOf30_4)+(4.001*NoOf30_5)+(-1.784*NoOf60_1)+(-0.362*NoOf60_2)+(- 3.125*NoOf90_1)+(-1.311*NoOf90_2)+(-0.549*NoOf90_3)+1.442. • Training Set – Stratified sampling of 4000 records with SeriousDlquin2Yrs = 1 and another 4000 with SeriousDlquin2Yrs = 0 • A cut off value of 0.4 gave optimal results
  • 32. Model 2 - Variables Used  Variables used • Age_OptimalBin • NumberOfTime3059DaysPastDueNotWorse_OptimalBin • NumberOfTime6089DaysPastDueNotWorse_OptimalBin • NumberOfTimes90DaysLate_OptimalBin  Possible reasons why few other variables are not significant • Age has a non-linear relationship with MonthlyIncome • Other 3 variables in the equation are the indicators of number of defaults committed by the customer which has a relation with NumberOfOpenLinesOfCredit and RevolvingUtilizationsOfUnsecuredLines • MonthlyIncome will effect the DebtRatio
  • 33. Model 2 - Validation • Multiple test run has been performed on different sample sizes • The below given validation table was for a random sample of 90000. • Overall Accuracy 72.62% and Misclassification 27.37% • Risky ( = 1) prediction accuracy of 75.1%
  • 34. Model 2 – Pros and Cons  Capable of handling missing values (including 98,96)  Intermediate processing required is minimal (only binning required)  The model uses only 4 variables  Optimal binning used and not the industry standard binning  Other insights • Analyzing top 10% (most risky customer segment) 84% of the customer are below 56 years of age 72% have 1 or more past 30 days default • Analyzing bottom 10% (safest customer segment) All of them are of 64 years or above in age Almost all of them have 0 defaults under any case.
  • 35. MODEL 3 – USING VARIABLES DIRECTLY
  • 36. Model 3 – Using Variables Directly • Final model has following equation: 0.754+(0.031*Age)+(0.766*NumberOfTime3059DaysPastDueNotWorse)+(1.179*NumberOf Time6089DaysPastDueNotWorse)+(1.417*NumberOfTimes90DaysLate) • This model is simplest but business considerations were not accounted for, hence cannot assure robustness on deployment • It cannot handle missing values
  • 38. Conclusion & Limitations • Model 1 and Model 2 give similar accuracy levels. Model 3 is not recommended. Choice of final model is left to business based on the pros and cons mentioned • These models to be further validated for scalability and robustness • The test dataset given did not have delinquency values; hence after validation with 20% random samples from training data set further validation could not be performed using test dataset for accuracy check on a totally new set of data. • Assumptions taken on binning financial variables could change the significance of different variables in final model. This aspect to be validated with business