Credit default risk

Home Credit Default Risk
Can you predict how capable
each applicant is of repaying
a loan?
Cha ho seong

CONTENTS
01 About Home Credit
02 Data
03 Data analysis
04 model evaluation
05 model improvements
06 conclusion

01About Home Credit
Founded in 1997,
Home Credit Group is an international consumer finance provider with operations in 11 countries.
They focus on responsible lending primarily to people with little or no credit history.
Their services are simple, easy and fast.

02 Dataset
• There are both categorical data, numeric data, and character data.
• There are many variables.
checklist

02 Label Data
Imbalanced class problem exist!
checklist

02 EDA - integer type
There are many columns with extreme value.

02 EDA - integer type
• HOUR_APPR_PROCESS_START and TARGET
• CNT_CHILDREN and TARGET
Not repaid rate at which Loans are
processed in the morning hours seems
to be high.
As CNT_CHILDREN increases, not repaid ratio tends to increase

02 EDA – character type
• FLAG_OWN_CAR and TARGET • FLAG_OWN_REALTY and TARGET
Ownership of a car seems to have more impact on the repayment rate than property ownership.
• NAME_INCOME_TYPE and TARGET
- maternity leave
- Unemployed
• Economic variables

02
• CODE_GENDER and TARGET
• Demographic Variables
• NAME_FAMILY_STATUS and TARGET
• NAME_TYPE_SUITE and TARGET
EDA – character type

02 ED
• EMERGENCYSTATE_MODE and TARGET
EDA – character type
• Social status variable
• etc
• NAME_EDUCATION_TYPE and TARGET
• NAME_CONTRACT_TYPE and TARGET
On Cash loans , repaid rate is much higher
• OCCUPATION_TYPE and TARGET

02 EDEDA – numeric type
• Correlation between variables
Some variables have high correlation

02NULL Values
> mean(is.na(application)) [1] 0.2454355
Columns with null values greater than 50%
Too many null values
we assume that variables with
Large number of null values are
hard to present data
Decided to exclude those variables

03 Data analysis
Logistic binary regression
Which model would be appropriate for credit default prediction ?

03 Data analysis
Why Logistic binary regression ?
• few constraints on explanatory variables
• Classification problem

03Data analysis
1. Split data into Train, validation, test
(60%) (20%) (20%)
2. Model selection
Model 1. all variables used

03Data analysis
Multiple collinearity is suspected

03 Data analysis
Prediction
• Confusion matrix
Accuracy : 0.9248
High-accuracy estimated!
But,
Since there is a class imbalance problem,
it is necessary to confirm measurement criteria
other than accuracy.

03 Data analysis
Prediction
• Confusion matrix
sensitivity : 0.0101801
“Sensitivity” measures positive sample rate
accurately classified
True Positive
Positive
13
1277
=

03 Data analysis
AUC : 0.6956109
Prediction
• ROC curve and AUC

05Model improvements
Task1. imbalanced class problem
Task2. Multiple collinearity
Model 2

Task1. Imbalanced class problem
Try “SMOTE” Which is another sampling method .
It considers sample’s k nearest neighbors (in feature
space) and create a synthetic data point
> table(train$TARGET)
0 1
12521 950
 new_train <- SMOTE(TARGET ~ ., train, perc.over = 600, perc.under = 140)
> table(new_train$TARGET)
0 1
7979 6650

AMT_GOODS_PRICE , AMT_CREDIT : 0.9868
AMT_ANNUITY , AMT_GOODS_PRICE :0.7643
YEARS_BEGINEXPLUATATION_AVG, YEARS_BEGINEXPLUATATION_MODE : 0.9482
YEARS_BEGINEXPLUATATION_AVG,, YEARS_BEGINEXPLUATATION_MEDI : 0.9844
FLOORSMAX_MODE, FLOORSMAX_AVG : 0.9858
FLOORSMAX_MEDI, FLOORSMAX_AVG : 0.9971
YEARS_BEGINEXPLUATATION_MEDI, YEARS_BEGINEXPLUATATION_MODE : 0.9242
FLOORSMAX_MODE, FLOORSMAX_MEDI : 0.9883
FLOORSMAX_MODE, TOTALAREA_MODE : 0.6232
OBS_60_CNT_SOCIAL_CIRCLE, OBS_30_CNT_SOCIAL_CIRCLE : 0.9987
DEF_60_CNT_SOCIAL_CIRCLE , DEF_30_CNT_SOCIAL_CIRCLE : 0.8658
Excluding highly correlated independent variables

binary logistic regression seeks directly to minimize the sum of squared deviance residuals.
It is the deviance residuals which are implied in the ML algorithm of the regression.
As you will see from the back, that fact is a factor of lowering accuracy.

Accuracy : 0.948 -> 0.825 decrease
Sensitivity : 0.01 -> 0.33 increase !

AUC : 0.6680598
Prediction
• ROC curve and AUC

06Conclusion
Task1. imbalanced class problem
-> SMOTE
-> Exclude variables with high Correlation
Accuracy decreased
Sensitivity increased
AUC decreased
That's why sensitivity matters.
Despite the decrease in other criteria
Clearly there is no absolute criterion.
Especially , sensitivity is often used when the cost of prediction is high.

06Conclusion
Further study…
- In order to improve accuracy,
applying Other prediction model ( Decision tree, ANN , SVM)
- handling NA values without excluding

Credit default risk

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Credit default risk

Similar to Credit default risk (20)

More from chs71

More from chs71 (11)

Recently uploaded

Recently uploaded (20)

Credit default risk