Exploratory Data Analysis For Credit Risk Assesment
1. THE CREDIT RISK ANALYTICS
EDA Case Study By,
• Mr. Prathmesh Pise
• Mr. Vishal Patil
2. CONTENTS
Problem statement
Flow Chart
Importing and Cleaning1
Importing and Cleaning2
Approach
Data Visualization
Significant Insights
3. PROBLEM STATEMENT:
1. Aim is to identify patterns which indicate if a client had difficulty paying their installments which
will help the bank in taking following actions:
• Denying the loan
• Reducing the amount of loan
• Lending (to risky applicants) at a higher interest rate, etc.
2. Identifying the co-relation between dependent variables with target variable
3. To ensure that the consumers capable of repaying the loan are not rejected
5. 1. Imported pandas, matplotlib and seaborn library for loading the data and data
visualization
2. Target variable is flag variable weather a clients pays instalments on time or not
3. Two data frames were created from csv files namely,
• Application data- Contains all the information of the client at the time of application
• Previous application data - contains information about the client’s previous loan data
4. Dropped unnecessary columns like the one belonging to client’s house dimensions
5. Achieved 40% memory usage reduction by changing the data types of categorical
variables from object to category.
IMPORTING AND CLEANING1:
6. IMPORTING AND CLEANING2:
1. Imported required data set for previous application data set:
• Previous application data set as previous_app
2. Cleaned the data by removing columns that were less significant for
analysis and were prone to containing erroneous data, namely,
• WEEKDAY_APPR_PROCESS_START
• HOUR_APPR_PROCESS_START, etc.
3. Achieved 40% memory usage reduction by changing the data types
of categorical variables from object to category and dropping
unnecessary columns
7. HANDLING DATA AND MISSING VALUES:
1. Checked for null values in application_data and found that:
• OWN_CAR_AGE had 65.99%, OCCUPATION_TYPE had 31.35% and EXT_SOURCE_1
had 56.38% missing values
• Hence decided to drop these columns
2. We also checked for null values in previous_app and found that:
• RATE_INTEREST_PRIMARY had 99.64%
• RATE_INTEREST_PRIVILEGED had 99.64% had of Null values
• Hence we dropped them
3. The external source data had some missing values , We impute them to zero
as the External agencies have not provided score for these customers
meaning the client's account was not prone to be a defaulter. Hence score
was assumed as zero.
4. Took average of EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3 columns
creating ext_sources column.
5. In previous_app, NAME_TYPE_SUITE had 49% missing values and does not
affect whether the client will default or not. Hence, we drop this column.
8. 6. Defined a function null_percentage to calculate null values in the columns
from both the data sets.
7. Since data is imbalanced we have taken proportion of all the categories to
analyse the data and have used stacked bar plots as it enhances our
understanding.
8. Defined a function called stacker this function compares a categorical column
with our Target variable, it considers data imbalance and converts each
category into percentages and plots the stacked chart with their proportion.
9. Merged previous_app data set with application data set, to compare it with
our Target variable.
9. DATA VISUALIZATION
• Univariate analysis on following variables,
1. Target
2. Income
3. Children count
• Bi-variate analysis on Target variable against the following,
1. Gender & age
2. Contract type
3. Average external score
4. Income & occupation type
5. Education type etc
• Multi-variate analysis on Target variable against the following,
1. Income and education type
2. Income and previous application status
10. TARGET V/S GENDER
Inference:
• The percentage of Males that pay late installments is more than that of females.
• The percentage of Females paying on time is more than that of males.
11. TARGET V/S CONTRACT TYPE
Inference:
• The clients with Cash loans tend to pay late as compared to the clients with
Revolving loans.
12. TARGET V/S CAR
Inference:
• Percentage of people with No-Car and paying late installments are slightly more
than that of people with Car
13. TARGET V/S AVG_EXT_SCORE
Inference:
• 50% client population who delay their installment payments have a low average
external score, and it ranges from 0.2-0.4 approximately.
• The clients who pay their installments on time have a moderate average score ranging
from 0.3-0.5 approximately.
• There are some clients who have received a very high score and they delay their
installments.
14. TARGET V/S AMT INCOME
Inference:
• The clients with income less than 2 lakhs pa pay late installments among these
classes.
• The clients with income more that 6 lakhs pa i.e. Rich class is more likely to pay on
time than other classes.
15. TARGET V/S INCOME TYPE
Inference:
• Amongst all the Income types, the Others(Maternity leaves, Students, Unemployed clients, etc.) are the
one who tend to pay late installments.
• The Businessman income types do not pay late installments.
• The working class also have a higher percentage of people in late paying installments which is 10%.
16. TARGET V/S FAMILY STATUS
Inference:
• The clients who are Single/not married and the Civil marriage class tend to
pay late installments.
17. TARGET V/S HOUSING TYPE
Inference:
• The clients who live in rented apartments and with parents tend to pay late
installments.
• The clients who stay in office apartments pay on time installments.
18. Inference:
• The people who do not provide the Document2 tend to pay late
installments. Hence it is advisable to make this document mandatory.
TARGET V/S DOCUMENT 2
19. Inference:
• The people who provide mobile number tend to pay installments on time.
• Hence it is advisable to collect mobile number of the clients.
TARGET V/S CLIENTS PROVIDING MOBILE NUMBERS
20. TARGET V/S AGE
Inference:
• The clients with age below 25 tend to pay late installments.
• The clients with age of 65 and above pay the installments on time.
• The possible reason is that clients below age 25 are less financially stable as
compared to those above 65.
21. TARGET V/S OCCUPATION TYPE
Inference:
• Low skill laborers , Waiters/barmen staff , security staff , cooking , cleaning staff , drivers, Laborers tend
to pay late installments.
• Most of the accountants, High skill tech staff and HR-staffs pay the installments on time.
• The obvious reason being that they represent the sectors with higher salary.
22. TARGET V/S CNT_CHILDREN
Inference:
• The clients who have count of children greater than 5 tend to pay late installments.
• Most of the clients with count of children of 2 or 3 pay installments on time.
23. TARGET V/S NAME_EDUCATION_TYPE
Inference:
• The clients with academic degree pay installments on time.
• The clients with lower secondary education pay late installments.
24. MULTIVARIATE ANALYSIS ON NUMERIC VARIABLES
Inference:
• A positive high co-relation is seen between good's price and amount credit
• A positive high co-relation is seen between annuity amount and amount credit
• A positive high co-relation is seen between annuity amount and good's price
25. PROPORTIONS OF CLIENTS BASED ON PREVIOUS APPLICATION STATUS
Inference:
• Out of the total loan applications only 63% were Approved.
• 17% were Refused loan and 19% applications were cancelled by the clients.
26. HANDLING OUTLIERS
Inference:
• Outliers were observed in the annual income variable.
• 99% clients had their income less than 4.75 LPA
• Hence for analyzing the annual income, the analysis was limited to clients with annual
income less than 4.75
27. TARGET V/S INCOME V/S EDUCATION TYPE
Inference:
• The clients with Education type as academic degree and income in range of
3-3.6 Lakhs pay late installments as compared to those with low income
28. TARGET V/S NAME_CASH_LOAN_PURPOSE
Inference:
• The clients who previously took loan for the payments on other loan pay
late installments.
• Following them ,are the clients with Home/Office/Land Loan and personal
household expenses, they pay late installments
29. TARGET V/S INCOME V/S PREVIOUS APPLICATION
STATUS
Inference:
• Clients who took loan for Business Development and annual income above
2.6 LPA pay late instalments.
30. TARGET V/S PREVIOUS LOAN STATUS
Inference:
• The clients for whom the previous loan was Refused , pay
the installments late
31. KEY INSIGHTS
• Following are the strong indicators of default
1. NAME_HOUSING_TYPE : Clients living in rented apartments
2. NAME_FAMILY_STATUS : Clients belonging to Civil marriage
and those who are single/married
3. NAME_INCOME_TYPE : Maternity leave , students,
Unemployed clients
4. FLAG_DOCUMENT_2 : The clients who do not provide
document 2
5. FLAG_MOBIL : The clients who do not provide mobile number
6. OCCUPATION_TYPE : Low skill, Laborer, Waiters, Barmen,
Security staff
7. CNT_CHILDREN : Positive co-relation between number of
children with the chance of client being a defaulter
8. NAME_EDUCATION_TYPE : Clients with lower secondary and
secondary/ secondary special and incomplete higher
9. EDUCATION_TYPE : Clients with academic degree and annual
income between 3-3.6 lakhs
10. CASH_LOAN_PURPOSE : Clients with previous loan purpose as
payment on other loans
• Following clients should be targeted
1. CODE_GENDER : Females
2. NAME_CONTRACT_TYPE : Clients with revolving loans
3. FLAG_CODE_CAR : Clients with car
4. AVG_EXT_SCORE : Clients with moderate external score
5. AMT_INCOME_TOTAL : Clients with annual income
greater than 6 lakhs
6. NAME_INCOME_TYPE : The businessmen and pensioners
7. FLAG_MOBIL :Clients who provide mobile number
8. DAYS_BIRTH :Clients with age of 65 and above
9. OCCUPATION_TYPE : accountants, High skill tech staff and
HR-staffs pay the installments on time
10. NAME_EDUCATION_TYPE : Clients with academic degree
32. CONCLUSION
• Based on the inferences obtained, a credit score can be
set
• Variables which contributes towards the chances of a client
being a defaulter will be rated a low score
• The variables contributing towards the chances of a client paying
the installments on time, will be rated with high credit scores
• Based on the final credit score, bank can take following
decision,
1. Grant loan to clients with healthy overall credit score
2. Grant loan at higher interest rates to clients with
comparatively low credit scores
3. Reject loan for clients with extremely low credit score