2. Summary
• Credit risk – The probability of default
• Data Cleansing
• Logistic Regression
• Linear Discriminant Analysis
• Comparison of the LR and LDA
• Factor Analysis
3. Credit Risk
What is it?
• The risk of default on a debt that may arise from a borrower failing to make
required payments.
Impact on the lender?
• Lost principal and interest, disruption to cash flows, and increased collection
costs.
How to estimate it?
• Credit risk arises from the potential that a borrower or counterparty will fail to perform
on an obligation
4. Sources of risk?
• For most banks, loans are the largest and most obvious source of credit risk.
• There are other sources of credit risk both on and off the balance sheet including
letters of credit unfunded loan commitments, and lines of credit.
• Other products, activities, and services that expose a bank to credit risk are credit
derivatives, foreign exchange, and cash management services.
Credit Risk
5. Credit Scoring vs Risk
Estimation of risk?
• The risk posed by the borrower is inversely proportional to the credit score.
• A statistically derived numeric expression of a person's creditworthiness that is used by
lenders to access the likelihood that a person will repay his or her debts.
• A credit score is based on, among other things, a person's past credit history (300-850)
6. Credit Scoring
• Consumers can typically keep their credit scores high by maintaining a long history of
always paying their bills on time and not having too much debt.
• A FICO score is the most widely used credit scoring system.
• A credit score is primarily based on a credit report information typically sourced from
credit bureaus.
16. Data Cleaning
• Monthly Income
• The Histogram after running
Multiple Linear Regression
on Missing Values
17. Data Cleaning
• Debt Ratio
• We found that the Debt Ratio was extremely high in many cases.
• Upon Closer inspection, we found out that high debt ratio was present for those records
whose Monthly Income was unknown.
• From this we inferred that the Debt Ratio could most probably be the Debt.
18. Data Cleaning
• Debt Ratio
• We replaced the high values of debt ratio by dividing it by the predicted values of the
monthly income.
• The new mean after replacement was 0.67
20. Data Modelling
• Split the dataset into Training data (70%) and Test Data (30%).
• Computed Co-relation Matrix among Independent variables.
• The variables had very less Co-relation amongst themselves.
• Ran Logistic Regression by using Stepwise selection.
• Ran Linear Discriminant Analysis.
• Compared both the models by measuring their accuracy of prediction.
• Ran both models on significant Factors using Factor Analysis.
22. Logistic Regression
• Ran Logistic Regression separately for each variable.
• Computed the ROC curve for each variable and compared the AUC value.
23. Stepwise Selection
• Overall Model was Significant.
• All the variables were included in the
model.
• The model built on the Training data
was tested on the Test data.
• Probability of default > 0.7 was coded
as 1, and Probability of default <0.7
was coded as 0.
24. Logistic Regression on Test Data
Overall Accuracy = (41374+291)/(41374+291+175+2661)
= 93.6 %
True Positive Rate = TP / (TP+FN)
= 9.85%
True Negative Rate = TN / (TN+FP)
= 99.5%
Predicted Values Actual Values
Confusion Matrix
32. Conclusion
• 80% time spent on Data cleaning
• Logistic Regression gives better results when data is not normal as compared to LDA
• Factors can be grouped for a logical understanding, with Debt Ratio and age explaining high
variance.
ROC curve measures how well your binary classifier is performing. It is comparing the rate at which the classifier is making correct prediction vs the rate at which the classifier is making wrong predictions. The diagonal line in the middle represents the classifier making random guess. Which means it is right 50% of the time and wrong the other 50% of the time. Here we have the ROC curve for Monthly income. From this ROC curve, we can calculate the area under this curve. In this case 0.8508. The higher the AUC value, the better is the model.
On the right, we have the AUC values for all the variables. Monthly Income has the best AUC value of 0.8508. Most of the other variables fall below 0.7 and debt ratio does worse than 0.5