LoanEDACaseStudy
Prepared by:
Amit Kumar Das
Date: 30/10/2022
 Aims to give you an idea of applying EDA in a real business scenario.
 Applying the techniques that you have learnt in the EDA module, you will also develop a basic
understanding of risk analytics in banking and financial services.
 Understand how data is used to minimize the risk of losing money while lending to customers.
 Present the overall approach of the analysis in a presentation.
 Mention the problem statement and the analysis approach briefly.
 Identify the missing data and use appropriate method to deal with it. (Remove columns/or replace
it with an appropriate value)
 Identify if there are outliers in the dataset. Also, mention why do you think it is an outlier. Again,
remember that for this exercise, it is not necessary to remove any data points
 Identify if there is data imbalance in the data. Find the ratio of data imbalance.
 Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms
 Include visualizations and summarize the most important results in the presentation. You are free
to choose the graphs which explain the numerical/categorical variables
 Insights should explain why the variable is important for differentiating the clients with payment
difficulties with all other cases
ProblemStatement:
 Number of columns having null
value more than 50%: 41 Nos
These columns should be
dropped.
 Number of columns having null
value less than 15%:13Nos
These columns shall beimputed
with suitable values which shall
be explained subsequently
For analysis ofimputation
selected 7 variables.
Columnshaving Nullvalue:
 Continuous variables:
'EXT_SOURCE_2‘
'AMT_GOODS_PRICE‘
• Categorical variables:
'OBS_30_CNT_SOCIAL_CIRCLE','
OBS_60_CNT_SOCIAL_CIRCLE',
'DEF_60_CNT_SOCIAL_CIRCLE',
‘DEF_30_CNT_SOCIAL_CIRCLE','
NAME_TYPE_SUITE
• For 'EXT_SOURCE_2' there is no
outliers present and missing
values canbe imputed with mean
or median (median:0.565)
 There are high number of outliers
present in the
AMT_GOODS_PRICEdata. Hence
it is recommended to impute data
withMedian value i.e. 450000
For categorical variable the value which should be imputed should be the maximum infrequency.
So the value to be imputed are:
NAME_TYPE_SUITE: Unaccompanied
OBS_30_CNT_SOCIAL_CIRCLE: 0.0
DEF_30_CNT_SOCIAL_CIRCLE: 0.0
OBS_60_CNT_SOCIAL_CIRCLE: 0.0
DEF_60_CNT_SOCIAL_CIRCLE: 0.0
Data Imputation analysis for columnshaving <15%nullvalue:
Checkingthe outlierfor
numerical variables:
 The first quartile almost missing
for CNT_CHILDRENthat means
most of the data are present in
the firstquartile.
 There is single high value data
point asoutlier present in
AMT_INCOME_TOTALand
DAYS_EMPLOYED. Removal
this point will drastically impact
the box plot for further analysis.
 The first quartiles is slim
compare to third quartile for
AMT_CREDIT,AMT_ANNUITY
,
DAYS_REGISTRATION.This
mean data are skewedtowards
first quartile.
AMT_INCOME_RANGE:
 The people having100000-
200000 are having higher
number of loan and also
are higher indefaulter
 The income segmenthaving
>500000 are having less
defaulter.
AMT_CREDIT_RANGE:
 The people having<100000
loan are lessdefaulter.
 income having morethan
>100000 are almost equal%
of loandefaulter
Univariate analysis for categorical variables
NAME_INCOME_TYPE:
 Student pensioner andbusiness
have higher percentage of loan
repayment.
 Working, State servant and
Commercial associates havehigher
default percentage.
 Maternity category issignificantly
higher problem inrepayment.
NAME_CONTRACT_TYPE
 For contract type‘cash loans’ is
having higher number of credits
than ‘Revolvingloans’contract type.
 From the graphs we can seethat the
Revolving loans are small amount
compared to Cashloans but the %
of non payment for the revolving
loans are comparativelyhigh.
Univariate analysis for categorical variables
Univariate analysis for categorical variables
CODE_GENDER:
 The %ofdefaulters are more in
Male thanFemale
FLAG_OWN_CAR:
 The person owning car ishaving
higher percentage ofdefaulter.
Univariate analysis for continuous variables
 Days Birth:The people having
higher age are having higher
probability ofrepayment.
 Some outliers are observed in In
'AMT_ANNUITY','AMT_GOODS_P
RICE','DAYS_EMPLOYED',
DAYS_LAST_PHONE_CHANGEin
the dataset.
 Lessoutlier observed inDays Birth
 DAYS_EMPLOYED. Removalof
this point will drastically impact
the box plot for further analysis.
Univariate analysis for continuous variables
 Lessoutlier observedin
DAYS_ID_PUBLISH
 1st quartile is smaller than third
quartile in In
'AMT_ANNUITY','AMT_GOODS_P
RICE',
DAYS_LAST_PHONE_CHANGE.
 In DAYS_ID_PUBLISH: people
changing ID in recent days are
relatively prone to be default.
Bivariate analysis for numerical variables – Target 0 (Client having no payment difficulties)
 Family status of 'civil marriage',
'marriage' and 'separated' of
Academic degree education are
having higher number of credits
than others.
 Also, higher education of family
status of 'marriage', 'single' and
'civil marriage' are having more
outliers.
 Civil marriage forAcademic degree
is having most of the credits in the
third quartile.
 In Education type 'Higher
education' the income amount is
mostly equalwith family status. It
does contain manyoutliers.
 Lessoutlier are having for
Academicdegree but there income
amount is little higher that Higher
education.
 Lower secondary of civil marriage
family status are have less income
amount thanothers
Bivariate analysis for numerical variables – Target 0 (Client having no payment difficulties)
Bivariate analysis for numerical variables – Target 1 (Client having payment difficulties)
 Observations are Quitesimilar
withTarget 0
 Family status of 'civil marriage',
'marriage' and 'separated' of
Academic degree education are
having higher number of credits
than others.
 Most of the outliers are from
Education type 'Higher
education' and 'Secondary'.
 Civilmarriage forAcademic degree
is havingmost of the credits in the
third quartile.
Bivariate analysis for numerical variables – Target 1 (Client having payment difficulties)
 There is also have somesimilarity
withTarget 0
 Education type 'Higher
education' the income amount is
mostly equal with familystatus.
 Lessoutlier are having for
Academic degree but there
income amount islittle higher that
Higher education.
 Lower secondary are haveless
income amount thanothers.
Target0
:
Target1
:
 From the correlation analysis it
is inferred that the highest
correlation (1.0) is between
(OBS_60_CNT_SOCIAL_CIRCLE
with
OBS_30_CNT_SOCIAL_CIRCLE)
and (FLOORSMAX_MEDI with
FLOORSMAX_AVG) which is
samefor both the data set.
Correlation
Univariate analysis for combined dataset (Distribution of contract status with purpose)
Most rejection of loanscamefrom purpose 'repairs'. For education purposes we have equal number of approves and rejection Payingother loansand
buying anewcarishaving significant higher rejection than approves.
Univariate analysis for combined dataset (Distribution of the purpose with target)
Loan purposes with 'Repairs' are facing more difficulties in payment on time. There are few placeswhere loan payment is significant higher than facing
difficulties.Theyare 'Buyingagarage','Business development', 'Buyingland', 'Buyinganewcar' and 'Education' Hencewe canfocus on these purposesfor
which the client is having for minimal payment difficulties.
Bivariate analysis for combined dataset
 The credit amount of Loan
purposes like 'Buying a
home', 'Buying aland',
'Buying anew car' and
'Building ahouse' is
higher.
 Income type of state
servants have asignificant
amount ofcredit applied
 Money for third person or
aHobby is having less
credits applied.
 For Housing type, office
apartment is having higher
credit of target 0 and co-op
apartment is having higher
credit of target 1.
 So, we can conclude that
bank should avoid giving
loans to the housing type of
co-op apartment astheyare
having difficulties in
payment.
 Bank can focus mostly on
housing type with parents or
Houseapartment or
municipal apartment for
successfulpayments.
Bivariate analysis for combined dataset
 Banks should focus more on contract type ‘Student’ ,’pensioner’ and ‘Businessman’with
housing ‘type other than ‘Co-op apartment’ for successful payments.
 Banks should focus lesson income type ‘Working’asthey are having most number of
unsuccessful payments.
 In loan purpose‘Repairs’:
 Although having higher number of rejection in loan purposes with 'Repairs' there are
observed difficulties in payment on time.
 Thereare few places where loan payment is delay is significantly high.
 Bank should keep continue to caution while giving loan for this purpose.
 Bank should avoid giving loans to the housing type of co-op apartment asthey are having
difficulties inpayment.
 Bank can focus mostly on housing type ‘with parents’ , ‘Houseapartment’and ‘municipal
apartment’ for successfulpayments.
Conclusion/Recommendation:

EDA_Case_Study_PPT.pptx

  • 1.
  • 2.
     Aims togive you an idea of applying EDA in a real business scenario.  Applying the techniques that you have learnt in the EDA module, you will also develop a basic understanding of risk analytics in banking and financial services.  Understand how data is used to minimize the risk of losing money while lending to customers.  Present the overall approach of the analysis in a presentation.  Mention the problem statement and the analysis approach briefly.  Identify the missing data and use appropriate method to deal with it. (Remove columns/or replace it with an appropriate value)  Identify if there are outliers in the dataset. Also, mention why do you think it is an outlier. Again, remember that for this exercise, it is not necessary to remove any data points  Identify if there is data imbalance in the data. Find the ratio of data imbalance.  Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms  Include visualizations and summarize the most important results in the presentation. You are free to choose the graphs which explain the numerical/categorical variables  Insights should explain why the variable is important for differentiating the clients with payment difficulties with all other cases ProblemStatement:
  • 3.
     Number ofcolumns having null value more than 50%: 41 Nos These columns should be dropped.  Number of columns having null value less than 15%:13Nos These columns shall beimputed with suitable values which shall be explained subsequently For analysis ofimputation selected 7 variables. Columnshaving Nullvalue:
  • 4.
     Continuous variables: 'EXT_SOURCE_2‘ 'AMT_GOODS_PRICE‘ •Categorical variables: 'OBS_30_CNT_SOCIAL_CIRCLE',' OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', ‘DEF_30_CNT_SOCIAL_CIRCLE',' NAME_TYPE_SUITE • For 'EXT_SOURCE_2' there is no outliers present and missing values canbe imputed with mean or median (median:0.565)  There are high number of outliers present in the AMT_GOODS_PRICEdata. Hence it is recommended to impute data withMedian value i.e. 450000 For categorical variable the value which should be imputed should be the maximum infrequency. So the value to be imputed are: NAME_TYPE_SUITE: Unaccompanied OBS_30_CNT_SOCIAL_CIRCLE: 0.0 DEF_30_CNT_SOCIAL_CIRCLE: 0.0 OBS_60_CNT_SOCIAL_CIRCLE: 0.0 DEF_60_CNT_SOCIAL_CIRCLE: 0.0 Data Imputation analysis for columnshaving <15%nullvalue:
  • 5.
    Checkingthe outlierfor numerical variables: The first quartile almost missing for CNT_CHILDRENthat means most of the data are present in the firstquartile.  There is single high value data point asoutlier present in AMT_INCOME_TOTALand DAYS_EMPLOYED. Removal this point will drastically impact the box plot for further analysis.  The first quartiles is slim compare to third quartile for AMT_CREDIT,AMT_ANNUITY , DAYS_REGISTRATION.This mean data are skewedtowards first quartile.
  • 6.
    AMT_INCOME_RANGE:  The peoplehaving100000- 200000 are having higher number of loan and also are higher indefaulter  The income segmenthaving >500000 are having less defaulter. AMT_CREDIT_RANGE:  The people having<100000 loan are lessdefaulter.  income having morethan >100000 are almost equal% of loandefaulter Univariate analysis for categorical variables
  • 7.
    NAME_INCOME_TYPE:  Student pensionerandbusiness have higher percentage of loan repayment.  Working, State servant and Commercial associates havehigher default percentage.  Maternity category issignificantly higher problem inrepayment. NAME_CONTRACT_TYPE  For contract type‘cash loans’ is having higher number of credits than ‘Revolvingloans’contract type.  From the graphs we can seethat the Revolving loans are small amount compared to Cashloans but the % of non payment for the revolving loans are comparativelyhigh. Univariate analysis for categorical variables
  • 8.
    Univariate analysis forcategorical variables CODE_GENDER:  The %ofdefaulters are more in Male thanFemale FLAG_OWN_CAR:  The person owning car ishaving higher percentage ofdefaulter.
  • 9.
    Univariate analysis forcontinuous variables  Days Birth:The people having higher age are having higher probability ofrepayment.  Some outliers are observed in In 'AMT_ANNUITY','AMT_GOODS_P RICE','DAYS_EMPLOYED', DAYS_LAST_PHONE_CHANGEin the dataset.  Lessoutlier observed inDays Birth  DAYS_EMPLOYED. Removalof this point will drastically impact the box plot for further analysis.
  • 10.
    Univariate analysis forcontinuous variables  Lessoutlier observedin DAYS_ID_PUBLISH  1st quartile is smaller than third quartile in In 'AMT_ANNUITY','AMT_GOODS_P RICE', DAYS_LAST_PHONE_CHANGE.  In DAYS_ID_PUBLISH: people changing ID in recent days are relatively prone to be default.
  • 11.
    Bivariate analysis fornumerical variables – Target 0 (Client having no payment difficulties)  Family status of 'civil marriage', 'marriage' and 'separated' of Academic degree education are having higher number of credits than others.  Also, higher education of family status of 'marriage', 'single' and 'civil marriage' are having more outliers.  Civil marriage forAcademic degree is having most of the credits in the third quartile.
  • 12.
     In Educationtype 'Higher education' the income amount is mostly equalwith family status. It does contain manyoutliers.  Lessoutlier are having for Academicdegree but there income amount is little higher that Higher education.  Lower secondary of civil marriage family status are have less income amount thanothers Bivariate analysis for numerical variables – Target 0 (Client having no payment difficulties)
  • 13.
    Bivariate analysis fornumerical variables – Target 1 (Client having payment difficulties)  Observations are Quitesimilar withTarget 0  Family status of 'civil marriage', 'marriage' and 'separated' of Academic degree education are having higher number of credits than others.  Most of the outliers are from Education type 'Higher education' and 'Secondary'.  Civilmarriage forAcademic degree is havingmost of the credits in the third quartile.
  • 14.
    Bivariate analysis fornumerical variables – Target 1 (Client having payment difficulties)  There is also have somesimilarity withTarget 0  Education type 'Higher education' the income amount is mostly equal with familystatus.  Lessoutlier are having for Academic degree but there income amount islittle higher that Higher education.  Lower secondary are haveless income amount thanothers.
  • 15.
    Target0 : Target1 :  From thecorrelation analysis it is inferred that the highest correlation (1.0) is between (OBS_60_CNT_SOCIAL_CIRCLE with OBS_30_CNT_SOCIAL_CIRCLE) and (FLOORSMAX_MEDI with FLOORSMAX_AVG) which is samefor both the data set. Correlation
  • 16.
    Univariate analysis forcombined dataset (Distribution of contract status with purpose) Most rejection of loanscamefrom purpose 'repairs'. For education purposes we have equal number of approves and rejection Payingother loansand buying anewcarishaving significant higher rejection than approves.
  • 17.
    Univariate analysis forcombined dataset (Distribution of the purpose with target) Loan purposes with 'Repairs' are facing more difficulties in payment on time. There are few placeswhere loan payment is significant higher than facing difficulties.Theyare 'Buyingagarage','Business development', 'Buyingland', 'Buyinganewcar' and 'Education' Hencewe canfocus on these purposesfor which the client is having for minimal payment difficulties.
  • 18.
    Bivariate analysis forcombined dataset  The credit amount of Loan purposes like 'Buying a home', 'Buying aland', 'Buying anew car' and 'Building ahouse' is higher.  Income type of state servants have asignificant amount ofcredit applied  Money for third person or aHobby is having less credits applied.
  • 19.
     For Housingtype, office apartment is having higher credit of target 0 and co-op apartment is having higher credit of target 1.  So, we can conclude that bank should avoid giving loans to the housing type of co-op apartment astheyare having difficulties in payment.  Bank can focus mostly on housing type with parents or Houseapartment or municipal apartment for successfulpayments. Bivariate analysis for combined dataset
  • 20.
     Banks shouldfocus more on contract type ‘Student’ ,’pensioner’ and ‘Businessman’with housing ‘type other than ‘Co-op apartment’ for successful payments.  Banks should focus lesson income type ‘Working’asthey are having most number of unsuccessful payments.  In loan purpose‘Repairs’:  Although having higher number of rejection in loan purposes with 'Repairs' there are observed difficulties in payment on time.  Thereare few places where loan payment is delay is significantly high.  Bank should keep continue to caution while giving loan for this purpose.  Bank should avoid giving loans to the housing type of co-op apartment asthey are having difficulties inpayment.  Bank can focus mostly on housing type ‘with parents’ , ‘Houseapartment’and ‘municipal apartment’ for successfulpayments. Conclusion/Recommendation: