Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case Credit EDA Case StudyStudyStudyStudyStudy
2. Aims to give you an idea of applying EDA in a real business scenario.
Applying the techniques that you have learnt in the EDA module, you will also develop a basic
understanding of risk analytics in banking and financial services.
Understand how data is used to minimize the risk of losing money while lending to customers.
Present the overall approach of the analysis in a presentation.
Mention the problem statement and the analysis approach briefly.
Identify the missing data and use appropriate method to deal with it. (Remove columns/or replace
it with an appropriate value)
Identify if there are outliers in the dataset. Also, mention why do you think it is an outlier. Again,
remember that for this exercise, it is not necessary to remove any data points
Identify if there is data imbalance in the data. Find the ratio of data imbalance.
Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms
Include visualizations and summarize the most important results in the presentation. You are free
to choose the graphs which explain the numerical/categorical variables
Insights should explain why the variable is important for differentiating the clients with payment
difficulties with all other cases
ProblemStatement:
3. Number of columns having null
value more than 50%: 41 Nos
These columns should be
dropped.
Number of columns having null
value less than 15%:13Nos
These columns shall beimputed
with suitable values which shall
be explained subsequently
For analysis ofimputation
selected 7 variables.
Columnshaving Nullvalue:
4. Continuous variables:
'EXT_SOURCE_2‘
'AMT_GOODS_PRICE‘
• Categorical variables:
'OBS_30_CNT_SOCIAL_CIRCLE','
OBS_60_CNT_SOCIAL_CIRCLE',
'DEF_60_CNT_SOCIAL_CIRCLE',
‘DEF_30_CNT_SOCIAL_CIRCLE','
NAME_TYPE_SUITE
• For 'EXT_SOURCE_2' there is no
outliers present and missing
values canbe imputed with mean
or median (median:0.565)
There are high number of outliers
present in the
AMT_GOODS_PRICEdata. Hence
it is recommended to impute data
withMedian value i.e. 450000
For categorical variable the value which should be imputed should be the maximum infrequency.
So the value to be imputed are:
NAME_TYPE_SUITE: Unaccompanied
OBS_30_CNT_SOCIAL_CIRCLE: 0.0
DEF_30_CNT_SOCIAL_CIRCLE: 0.0
OBS_60_CNT_SOCIAL_CIRCLE: 0.0
DEF_60_CNT_SOCIAL_CIRCLE: 0.0
Data Imputation analysis for columnshaving <15%nullvalue:
5. Checkingthe outlierfor
numerical variables:
The first quartile almost missing
for CNT_CHILDRENthat means
most of the data are present in
the firstquartile.
There is single high value data
point asoutlier present in
AMT_INCOME_TOTALand
DAYS_EMPLOYED. Removal
this point will drastically impact
the box plot for further analysis.
The first quartiles is slim
compare to third quartile for
AMT_CREDIT,AMT_ANNUITY
,
DAYS_REGISTRATION.This
mean data are skewedtowards
first quartile.
6. AMT_INCOME_RANGE:
The people having100000-
200000 are having higher
number of loan and also
are higher indefaulter
The income segmenthaving
>500000 are having less
defaulter.
AMT_CREDIT_RANGE:
The people having<100000
loan are lessdefaulter.
income having morethan
>100000 are almost equal%
of loandefaulter
Univariate analysis for categorical variables
7. NAME_INCOME_TYPE:
Student pensioner andbusiness
have higher percentage of loan
repayment.
Working, State servant and
Commercial associates havehigher
default percentage.
Maternity category issignificantly
higher problem inrepayment.
NAME_CONTRACT_TYPE
For contract type‘cash loans’ is
having higher number of credits
than ‘Revolvingloans’contract type.
From the graphs we can seethat the
Revolving loans are small amount
compared to Cashloans but the %
of non payment for the revolving
loans are comparativelyhigh.
Univariate analysis for categorical variables
8. Univariate analysis for categorical variables
CODE_GENDER:
The %ofdefaulters are more in
Male thanFemale
FLAG_OWN_CAR:
The person owning car ishaving
higher percentage ofdefaulter.
9. Univariate analysis for continuous variables
Days Birth:The people having
higher age are having higher
probability ofrepayment.
Some outliers are observed in In
'AMT_ANNUITY','AMT_GOODS_P
RICE','DAYS_EMPLOYED',
DAYS_LAST_PHONE_CHANGEin
the dataset.
Lessoutlier observed inDays Birth
DAYS_EMPLOYED. Removalof
this point will drastically impact
the box plot for further analysis.
10. Univariate analysis for continuous variables
Lessoutlier observedin
DAYS_ID_PUBLISH
1st quartile is smaller than third
quartile in In
'AMT_ANNUITY','AMT_GOODS_P
RICE',
DAYS_LAST_PHONE_CHANGE.
In DAYS_ID_PUBLISH: people
changing ID in recent days are
relatively prone to be default.
11. Bivariate analysis for numerical variables – Target 0 (Client having no payment difficulties)
Family status of 'civil marriage',
'marriage' and 'separated' of
Academic degree education are
having higher number of credits
than others.
Also, higher education of family
status of 'marriage', 'single' and
'civil marriage' are having more
outliers.
Civil marriage forAcademic degree
is having most of the credits in the
third quartile.
12. In Education type 'Higher
education' the income amount is
mostly equalwith family status. It
does contain manyoutliers.
Lessoutlier are having for
Academicdegree but there income
amount is little higher that Higher
education.
Lower secondary of civil marriage
family status are have less income
amount thanothers
Bivariate analysis for numerical variables – Target 0 (Client having no payment difficulties)
13. Bivariate analysis for numerical variables – Target 1 (Client having payment difficulties)
Observations are Quitesimilar
withTarget 0
Family status of 'civil marriage',
'marriage' and 'separated' of
Academic degree education are
having higher number of credits
than others.
Most of the outliers are from
Education type 'Higher
education' and 'Secondary'.
Civilmarriage forAcademic degree
is havingmost of the credits in the
third quartile.
14. Bivariate analysis for numerical variables – Target 1 (Client having payment difficulties)
There is also have somesimilarity
withTarget 0
Education type 'Higher
education' the income amount is
mostly equal with familystatus.
Lessoutlier are having for
Academic degree but there
income amount islittle higher that
Higher education.
Lower secondary are haveless
income amount thanothers.
15. Target0
:
Target1
:
From the correlation analysis it
is inferred that the highest
correlation (1.0) is between
(OBS_60_CNT_SOCIAL_CIRCLE
with
OBS_30_CNT_SOCIAL_CIRCLE)
and (FLOORSMAX_MEDI with
FLOORSMAX_AVG) which is
samefor both the data set.
Correlation
16. Univariate analysis for combined dataset (Distribution of contract status with purpose)
Most rejection of loanscamefrom purpose 'repairs'. For education purposes we have equal number of approves and rejection Payingother loansand
buying anewcarishaving significant higher rejection than approves.
17. Univariate analysis for combined dataset (Distribution of the purpose with target)
Loan purposes with 'Repairs' are facing more difficulties in payment on time. There are few placeswhere loan payment is significant higher than facing
difficulties.Theyare 'Buyingagarage','Business development', 'Buyingland', 'Buyinganewcar' and 'Education' Hencewe canfocus on these purposesfor
which the client is having for minimal payment difficulties.
18. Bivariate analysis for combined dataset
The credit amount of Loan
purposes like 'Buying a
home', 'Buying aland',
'Buying anew car' and
'Building ahouse' is
higher.
Income type of state
servants have asignificant
amount ofcredit applied
Money for third person or
aHobby is having less
credits applied.
19. For Housing type, office
apartment is having higher
credit of target 0 and co-op
apartment is having higher
credit of target 1.
So, we can conclude that
bank should avoid giving
loans to the housing type of
co-op apartment astheyare
having difficulties in
payment.
Bank can focus mostly on
housing type with parents or
Houseapartment or
municipal apartment for
successfulpayments.
Bivariate analysis for combined dataset
20. Banks should focus more on contract type ‘Student’ ,’pensioner’ and ‘Businessman’with
housing ‘type other than ‘Co-op apartment’ for successful payments.
Banks should focus lesson income type ‘Working’asthey are having most number of
unsuccessful payments.
In loan purpose‘Repairs’:
Although having higher number of rejection in loan purposes with 'Repairs' there are
observed difficulties in payment on time.
Thereare few places where loan payment is delay is significantly high.
Bank should keep continue to caution while giving loan for this purpose.
Bank should avoid giving loans to the housing type of co-op apartment asthey are having
difficulties inpayment.
Bank can focus mostly on housing type ‘with parents’ , ‘Houseapartment’and ‘municipal
apartment’ for successfulpayments.
Conclusion/Recommendation: