exploratory data analysis on german credit data

E.D.A
By
Adithi – E19002
Bhaswani – E19009
Neha – E19018

BRIEF OVERVIEW:
 To identify the attributes having influential power in
decision making to either reject or accept loan application.
 Context of the data set: The original dataset contains 1000
entries with 20 categorical/symbolic attributes. In this
dataset, each entry represents a person who takes a credit
by a bank. Each person is classified as good or bad credit
risks according to the set of attributes.

S.No Variable Description Data type
1 Credibility 1 : credit-worthy; [good risk]
0 : not credit-worthy [ bad risk ]
Categorical
2 Balance of
current
account
no running account - 1
No balance or debit -2;
0 <= ... < 200 DM – 3;
... >= 200 DM or checking account for at least 1 year-4;
Categorical
3 Duration in
months
(metric)
[<=12] – up to 1 year
[12< ... <= 24] – 1-2 years
[24 < ... <= 36] – 2-3 years
[36 < ... <= 48] – 3- 4 years
[48< ... <= 60] – 4-5 years
[60 < ... <= 72] – 5-6 years
NUMERICAL
4 Payment of
previous
credits
no previous credits / paid back all previous credits - 2
paid back previous credits at this bank - 4
no problems with current credits at this bank - 3
problematic running account / there are further credits running but
at other banks – 1
hesitant payment of previous credits - 0
CATEGORICAL
5 Purpose of
credit
new car - 1
used car - 2
items of furniture - 3
radio / television - 4
household appliances- 5
Repair -6
Education - 7
Vacation- 8
Retraining -9
Business- 10
Other -0
CATEGORICAL
ATTRIBUTES:

6 Amount of credit in DM [<=1500 ] - 1;
[1500 < ... <= 4500] - 2;
[4500 < ... <= 7500] - 3;
[7500 < ... <= 10500] - 4;
[10500 < ... <=13500] - 5;
[13500 < ... <= 16500] - 6;
[> 16500] - 7
Numerical
7 Value of savings or stocks not available / no savings - 1
[< 100], - 2
[100,- <= ... < 500], - 3
[500,- <= ... < 1000], - 4
[>= 1000], - 5
Categorical
8 Has been employed by
current employer
For Unemployed - 1
[<= 1] - 2
[1 <= ... < 4 ] - 3
[4 <= ... < 7]- 4
[>= 7] - 5
Categorical
9 rate Instalment in % of
available income
[>= 35] - 1
[25 <= ... < 35] - 2
[20 <= ... < 25] - 3
[< 20] - 4
Categorical
10 Marital Status / Sex male: divorced / living apart – 1; male: single- 2
male: married / widowed – 3; female: 4
Categorical
11 Further debtors /
Guarantors
None – 1; Co-Applicant – 2; Guarantor - 3 Categorical
12 Living in current household
for
[< 1 year] - 1
[1 <= ... < 4 ] years - 2
[4 <= ... < 7] years - 3
[ >= 7 ] years - 4
Categorical

13 Most valuable available assets Ownership of house or land - 4
Savings contract with a building society / life
insurance - 3
Car / other - 2
Not available / no assets -1
Categorical
14 Age in years (categorized) [0 <= ... <= 25] - 1
[ 26 <= ... <= 39 ] - 2
[ 40 <= ... <= 59] - 3
[ 60 <= ... <= 64 ] - 4
[ >= 65 ] - 5
Numerical
15 Further running credits At other banks – 1
At department store or mail order house - 2
No further running credits – 3
Categorical
16 Type of apartment Rented-1; owned – 2 ; free - 3 Categorical
17 Number of previous credits at
this bank (including the
running one)
One- 1; two or three – 2; four or five –
3; six and above - 4
Categorical
18 Occupation Unemployed / unskilled with no permanent
residence - 1
Unskilled with permanent residence - 2
Skilled worker / skilled employee / minor civil
servant - 3
Executive / self-employed / higher civil servant
- 4
Categorical
19 Number of persons entitled to
maintenance
0 to 2 – 2 ; 3 and more - 1 Numerical
20 Telephone No- 1 ; yes - 2 Categorical
21 Foreign worker Yes- 1; no - 2 Categorical

• We have the population
distribution in
proposition of 70:30 risk
wise
• We have 4 numeric and
16 categorical features.
• Few non influencing
variables which may
not contribute for
decision making
• To find, which is the
most influencing
variable, we adapted a
techniques – WOE-IV
From the data:

WEIGHT OF
EVIDENCE-
INFORMATION VALUE
WOE - IV

WOE & IV are simple,
yet powerful
techniques to
perform variable
transformation and
selection.
It is widely used in
credit scoring to
measure the
separation of good vs
bad customers.

COMPUTATION
&
INTERPRETATION…!

Age Group
Total
Number of
Loans
Number of
Bad Loans
Numbef of
Good
Loans
% Bad
Loans
Name of
Group
Distibution
Bad (DB)
Distibution
Good (DG)
WOE DG - DB
(DG - DB)*
WOE
21 - 30 4821 206 4615 4.3% G1 0.135 0.078 -0.553 -0.057 0.0318
30 - 36 10266 357 9909 3.5% G2 0.235 0.167 -0.339 -0.067 0.0228
36 - 48 32926 776 32150 2.4% G3 0.510 0.542 0.062 0.032 0.0020
48 - 60 12788 183 12605 1.4% G4 0.120 0.213 0.570 0.092 0.0527
Total 60801 1522 59279 Information Value --> 0.1093

Higher the age higher
the credibility
But above sixty years
i.e., after retirement the
credibility is reduced
IV : 0.093
Weak predictive Power
Female have good
credibility
Among male married
have high credibility
IV : 0.045

Higher the balance in
account more the
probability to fall in good
risk
IV :
Savings Account: 0.196
Medium predictive Power
Current Account:0.666
Suspicious Predictive
Power / Too good to rely
on
Predictive Power Of:
CA>SB

Duration In
Months
Lower the duration
lower the bad risk
IV : 0.166
Medium predictive
Power
Amount of credit
Lower the amount
lower the bad risk
<=1500 also have slight
increase in bad risk
IV : 0.165
Medium predictive
Power

PURPOSE OF
CREDIT
If the purpose of the loan
is to create an asset good
risk should be high
Where as the purpose is
an expenditure , bad risk
should be high.
But for vacation it shows
high good risk.
On Further observation,
the no of loan given for
the purpose of vacation
are just 9 not even 1%
(0.9 %)
Hence ignored..!
IV : 0.166
PURPOSE 0 1 2 3 4 5 6 8 9 10
NOT CREDIBLE 89 17 58 62 4 8 22 1 34 5
CREDIBLE 145 86 123 218 8 14 28 8 63 7
Grand Total 234 103 181 280 12 22 50 9 97 12

Higher the no of years
employment , Higher the
credibility
IV : 0.086
People with no assets are
having high probability of
falling into credible
category
IV : 0.113

Payment Of
Previous Credits
Bad risk is observed in
people who are hesitant
to pay previous credits
IV : 0.293
Bad risk is observed in
people whose instalment
is lower in % of the
income.
Which is contrary…!
Though the pattern is
almost resembling the
population.
IV : 0.026

Higher the no of credits
availed higher the
credibility.
But not more than 6
credit facilities.
IV : 0.013
Not useful for prediction
People with no current
credits are having high
credibility.
IV : 0.085

If the loan is secured by a
guarantor it shows high
credibility.
IV : 0.032
People work abroad are
given high credibility
IV : 0.087
For people who have
Rented housing as got
high credibility..!
IV : 0.085
17.9% 71.4% 10.7%
96.3 % 3.7 %

Not influencing
variables as they are
representing the
population distribution
of 70:30 propositionIV VALUES:

Further analysis…!
ATTRIBUTE IV INTERPRETATION
Current Account Balance 0.666 Suspicious Predictive Power
Payment Status Of Previous Credit 0.293 Medium predictive Power
Value Savings/Stocks 0.196 Medium predictive Power
Purpose 0.166 Medium predictive Power
Duration Of Credit (Month) 0.165 Medium predictive Power
Credit Amount 0.119 Medium predictive Power
Most Valuable Available Asset 0.113 Medium predictive Power
Age 0.093 Weak predictive Power
Foreign Worker 0.087 Weak predictive Power
Length Of Current Employment 0.086 Weak predictive Power
Housing 0.085 Weak predictive Power
Concurrent Credit 0.058 Weak predictive Power
Sex & Marital Status 0.045 Weak predictive Power
Guarantor /Debtor 0.032 Weak predictive Power
Instalment Per Cent 0.026 Weak predictive Power
No Of Credits 0.013 Not useful for prediction
Telephone 0.01 Not useful for prediction
Occupation 0.009 Not useful for prediction
Duration In Current House 0.004 Not useful for prediction
Dependents 0.00004 Not useful for prediction

CHOOSING MODEL
 when customer applies for a loan, the bank accepts or rejects the
application based on predicted risk -probability of default- for the
application.
 Considering this is an objective segmentation, we need to have a
target/dependent variable. In this case it will be whether a
customer has Bad or good risk over the loan.
 If we are working on an objective segmentation problem, our aim
is to find conditions which help us find a segment which is very
similar on target variable value.
 Decision Tree is one of the commonly used as objective
segmentation techniques.
 Based on the WOE – IV we have chosen the variables with good
predictive power for building a decision tree

DECISION TREE:
 Interpretation:
 Train-test split : 70:30
 Class1 : credible
 Class 0: not credible
 Depth : 3
 Accuracy: 0.76
 Precision: 0.77
 Sensitivity: 0.92
 Specificity: 35
 F1 score: 0.84

 Interpretation:
 Train-test split : 70:30
 Class1 : credible
 Class 0: not credible
 Depth :4
 Accuracy: 0.74
 Precision: 0.77
 Sensitivity: 0.89
 Specificity: 37
 F1 score: 0.83

FURTHER ANALYSIS TO BE CONTD..
THANK YOU…!
Queries..?

exploratory data analysis on german credit data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to exploratory data analysis on german credit data

Similar to exploratory data analysis on german credit data (20)

Recently uploaded

Recently uploaded (20)

exploratory data analysis on german credit data