It is a complete end to to end project for beginners on a banking data. where this project predicts the weather the clients is going to pay next month premium. This project also includes data pre-processing like uni-variate analysis, Bi-Variate analysis, outlier-detection, imputing strategies and finally predicting
1. Problem Statement:
Your client is an Insurance company and they need your help in building a model to predict whether the policyholder (customer ) will
pay next premium on time or not.
By looking at the problem statement we can understand that "This is a classification problem"
Hypothesis Generation
1. Clients with high income will have higher chances of paying next premium
2. Clients with high dafault rate has higher chances of the not paying next premium
3. Clients with low income has higher chances of not paying next premium
4. Clients with medium income has higher chances of not paying premium if premium cost high
5. Clients with higher age has higher chances of not paying premium
In [2]: # Loading libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [3]: # It gives the directory
import o s
os.getcwd()
DataExtraction
In [6]: # Load the data sets
Test = pd.read_csv("test.csv")
Train = pd.read_csv( "train.csv")
Exploratory DataAnalysis
Steps in EDA:
1. Variable Identification
2. Univariate Analysis for Continious Variables
3. Univariate Analysis for Categorical Variables
4. Bivariate analysis for both Continious and categorical variables
5. Trating missing values
6. Outlier Treatment
7. Variable Transfor mati on
In [12]: # Variable Description
# id - Unique ID of the policy
# perc_premium_paid_by_cash_credit - Percentage of premium amount paid by cash or credit card
# age_in_days - Age in days of policy holder
# Income - Monthly Income of policy holder #
Count_3 -6_months_late - No of premiums late by 3 to 6 months # Count_6 -
12_months_late - No of premiums late by 6 to 12 months
# Count_more_than_12_months_late - No of premiums late by more than 12 months
# application_underwriting_score - No applications under the score of 90 are insured #
no_of_premiums_paid - Total premiums paid on time till now
# sourcing_channel - Sourcing channel for application
# residence_area_type - Area type of Residence (Urban/Rural)
In [14]: # Genrally data types int, float are continious variables but some times intiger varibles also in ca
tegorical in nature
Train["Count_3-6_months_late"]. value_counts()
In [18]: # If you notice age_in_days variable is in days, lets transform it to years
Train['age_in_days'] = Train["age_in_days"]/365
Note:
1. perc_premium _pai d_by_cash_cr edit this variable has no outliers
2. Age distribution is normal above 90 is a outlier
3. application_underwriting_score is skewed left side and score below 98 is an outlier
Univariate Analysis for Categorical Variables
Note
1.( Count_3- 6_m onths_late Count_6-12_m onths_l ate Count_more_than_12_m onths_late) most of the times these variables are 0, very
rarely people misses to pay for3-12 months
1. no_of_premi ums_pai d' most number of clients paid 8 times then trend keep decresing
2. sourcing_channel: 50% of the clients came from channel A
3. residence_area_type: Nearly 60% clients are from the urban area
4. Less then 10% clients are defaulters
In [41]: # Categorical - Continious Bivariate analysis
Train.groupby("Count_3-6_months_late")['Income'].mean() .plot .bar()
# No clear trend but some clients has higher and higher default rate
#Some clients has low Income and higher default rate
In [42]: Train.groupby("Count_6-12_months_late")[ 'Income'].mean().plot.bar()
# No clear trend but some clients has higher and higher default rate
#Some clients has low Income and higher default rate
In [43]: Train.groupby("Count_more_than_12_months_late")['Income'].mean() .plot .bar()
# This attribute has some clear trend higher the Income Higher the default rate
In [44]: Train.groupby("sourcing_channel")['Income'].mean() .plot.bar()
# Sorcing chanel E has high Income
In [45]: Train.groupby("residence_area_type")['Income' ].mean().plot.bar()
# Incomes of Rural and Urban Clients has almost same.
In [46]: Train.groupby("target")['Income'].mean() .plot.bar()
# Income less then 175k has higher chances of default rate
In [47]: Train.groupby("target")['Age']. mean().plot.bar()
# Age less then 50 has higher chances of default rate
In [48]: fig= plt.figure(figsize=(18,7))
Train.groupby("no_of_premiums_paid")['Age'].mean() .plot.bar()
# Age increses no of times of premium pay increses
Categorial - Categorial Bivariate Analysis
In [49]: # Create 2-way tables
pd.crosstab(Train['sourcing_channel' ], Train[ "target"])
# With this we can understad that sourcing chanel 'A' has low Income and high chances of default
# Overall percentage wise channel B has higher chances of default
In [50]: pd.crosstab(Train['residence_area_type'], Train["target"])
#Rural clients has higher chances of default
Missing valuestreatment
In [51]: Train.isnull().sum()
In [57]: Train.isnull().sum()
In [58]: # Drop un wanted columns
Train = Train.drop(['no_of_premiums_paid' ], axis = 1 )
In [59]: Test = Test.drop(['no_of_premiums_paid'], axis = 1)
Outlier Treatment
In [60]: # replace age above anything 90 with mean
import numpy as np
Train.loc[Train[ "Age"] >90 , 'Age'] = np.mean(Train["Age"])
DataTransformation
In [62]: # We have convert categorical variables to numbers, editing manually takes lot of time so we will us
e LabelEncoder function
from sklearn.preprocessing import LabelEncoder
number = LabelEncoder()
Train["sourcing_channel"] = number.fit_transform(Train[ "sourcing_channel"].astype('str'))
Test["sourcing_channel"] = number .fit_transform(Test[ "sourcing_channel"] .astype("str"))
Model Building
In [65]: # Drop un corelated variables to our target variable in both test and train data set
x_train = Train.drop([ 'target','Age' , 'Income' , 'application_underwriting_score' ,'residence_area_typ
e','sourcing_channel' ], axis =1)
Out[3]: 'C: Users dell'
In [4]: # To change the directory of the system
os.chdir('C:Users dellDownloads')
In [5]: os.getcwd()
Out[5]: 'C: Users dellDownloads'
In [7]: # Check the data weather it loaded or not
Test.head()
Out[7]:
id perc_premium_paid_by_cash_credit age_in_days IncomeCount_3- Count_6- Count_more_than_12_mo
6_months_late 12_months_late
0 649 0.001 27384 51150 0.0 0.0
1 81136 0.124 23735 285140 0.0 0.0
2 70762 1.000 17170 186030 0.0 0.0
3 53935 0.198 16068 123540 0.0 0.0
4 15476 0.041 10591 200020 1.0 0.0
In [8]: Train.head()
Out[8]:
id perc_premium_paid_by_cash_credit age_in_days Income
Count_3- Count_6-
Count_more_than_12_m
6_months_late 12_months_late
0 110936 0.429 12058 355060 0.0 0.0
1 41492 0.010 21546 315150 0.0 0.0
2 31300 0.917 17531 84140 2.0 3.0
3 19415 0.049 15341 250510 0.0 0.0
4 99379 0.052 31400 198680 0.0 0.0
In [9]: Test.info()
<class 'pandas .core.frame.DataFrame'>
RangeIndex: 34224 entries, 0 to 34223
Data columns (total 11 columns):
id 34224 non-null int64
perc_premium_paid_by_cash_credit 34224 non-null float64
age_in_days 34224 non-null int64
Income 34224 non-null int64
Count_3-6_months_late 34193 non-null float64
Count_6-12_months_late 34193 non-null float64
Count_more_than_ 12_months_late 34193 non-null float64
application_underwriting_score 32901 non-null float64
no_of_premiums_paid 34224 non-null int64
sourcing_channel 34224 non-null object
residence_area_type 34224 non-null object
dtypes: float64( 5), int64(4), object(2)
memory usage : 2. 9+ MB
In [10]: Train.info()
<class 'pandas .core.frame.DataFrame'>
RangeIndex: 79853 entries, 0 to 79852
Data columns (total 13 columns):
id 79853 non-null int64
perc_premium_paid_by_cash_credit 79853 non-null float64
age_in_days 79853 non-null int64
Income 79853 non-null int64
Count_3-6_months_late 79756 non-null float64
Count_6-12_months_late 79756 non-null float64
Count_more_than_ 12_months_late 79756 non-null float64
application_underwriting_score 76879 non-null float64
no_of_premiums_paid 79853 non-null int64
sourcing_channel 79853 non-null object
residence_area_type 79853 non-null object
premium 79853 non-null int64
target 79853 non-null int64
dtypes: float64(5), int64(6), object(2)
memory usage: 7.9+ MB
If you observe carefully Test data set have only 11 columns where as Train data set has 13 columns. we will remove the primium
column in Train data set
In [11]: Train = Train .drop(['premium'], axis =1)
Vaiable Identification
In [13]: # Identify continious and categorical variables
Train.dtypes
Out[13] : id int64
perc_premium_paid_by_cash_credit float64
age_in_day s int64
Income int64
Count_3-6_months_late float64
Count_6-12_months_late float64
Count_more_than_ 12_months_late float64
application_underwriting_score float64
no_of_premiums_paid int64
sourcing_channel object
residence_area_type object
target int64
dtype: object
By above output we notice that sourcing_channel, residence_ar ea_type are categorical variables
Out[14]: 0.0 66801
1.0 8826
2.0 2519
3.0 954
4.0 374
5.0 168
6.0 68
7.0 23
8.0 15
9.0 4
11.0 1
12.0 1
13.0 1
10.0 1
Name: Count_3-6_months_late, dtype: int64
In [15]: Train["Count_6-12_months_late"]. value_counts()
Out[15]: 0.0 75831
1.0 2680
2.0 693
3.0 317
4.0 130
5.0 46
6.0 26
7.0 11
8.0 5
10.0 4
9.0 4
14.0 2
11.0 2
13.0 2
17.0 1
12.0 1
15.0 1
Name: Count_6-12_months_late, dtype: int64
In [16]: Train["Count_more_than_12_months_late"].value_counts()
Out[16]: 0.0 76038
1.0 2996
2.0 498
3.0 151
4.0 48
5.0 13
6.0 6
7.0 3
8.0 2
11.0 1
Name: Count_more_than_12_months_late, dtype: int64
In [17]: Train["no_of_premiums_paid"]. value_counts()
Out[17]: 8 7184
9 7158
10 6873
7 6623
11 6395
6 5635
12 5407
13 4752
5 4215
14 3988
15 3264
4 2907
16 2678
17 2148
18 1799
3 1746
19 1355
20 1134
21 838
2 726
22 713
23 503
24 386
25 305
26 241
27 186
28 152
29 119
30 91
31 61
32 51
33 43
34 38
35 31
36 23
37 14
38 8
42 7
40 6
41 6
39 5
47 5
44 4
43 3
45 3
56 3
48 3
50 3
51 3
58 2
52 2
53 2
54 2
59 1
55 1
49 1
60 1
Name: no_of_premiums_paid, dtype: int64
In this section we noticed that below variables arecategorical in
nature
Count_3-6_m onths_late
Count_6-12_months_late
Count_more_than_12_months_late
application_underwriting_score
no_of_premi ums_pai d
sourcing_channel residence_ar ea_type
premium
targe
In [19]: Test['age_in_days'] = Test["age_in_days"]/365
In [20]: # Now rename 'age_in _days variable to Age'
Train = Train.rename(columns={"age_in_days": "Age"})
In [21]: Test = Test.rename(columns ={"age_in_days" : "Age"})
Univariate Analysis of Continious variables
In [22]: Train[[ "perc_premium_paid_by_cash_credit" ,"Age", "Income", "application_underwriting_score"]].descri
be()
Out[22]:
count
mean
std
min
25%
50%
75%
perc_premium_paid_by_cash_credit Age Income application_underwriting_score
79853.000000 79853.000000 7.985300e+04 76879.000000
0.314288 51.634786 2.088472e+05 99.067291
0.334915 14.270463 4.965826e+05 0.739799
0.000000 21.013699 2.403000e+04 91.900000
0.034000 41.024658 1.080100e+05 98.810000
0.167000 51.027397 1.665600e+05 99.210000
0.538000 62.016438 2.520900e+05 99.540000
max 1.000000 103.019178 9.026260e+07 99.890000
In [23]: Train['perc_premium_paid_by_cash_credit'].plot.hist()
Out[23]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc24e2550>
In [24]: Train['perc_premium_paid_by_cash_credit'].plot.box()
Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc220d780>
In [25]: Train['Age'] .plot.hist()
Out[25]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc228bb00>
In [26]: Train['Age'] .plot.box()
Out[26]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc230e278>
In [27]: Train['application_underwriting_score'].plot.hist()
Out[27]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc23bcba8>
In [28]: Train['application_underwriting_score'].plot.box()
Out[28]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2432c18>
In [29]: # Creating frequency tables and bar plots for the categorical variables
Train['Count_3-6_months_late'].value_counts() .plot.bar()
Out[29]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2490978>
In [30]: Train['Count_6-12_months_late'] .value_counts() .plot.bar()
Out[30]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2a2d940>
In [31]: Train['Count_more_than_12_months_late'].value_counts(). plot.bar()
Out[31]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2acc320>
In [32]: Train['application_underwriting_score'].value_counts(). plot.hist()
Out[32]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2caab70>
In [33]: fig= plt.figure(figsize=(12 ,7))
Train['no_of_premiums_paid'].value_counts().plot.bar()
Out[33]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2ca04e0>
In [34]: Train['sourcing_channel'].value_counts().plot. bar()
Out[34]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc38047f0>
In [35]: Train['residence_area_type' ].value_counts().plot.bar()
Out[35]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc382f128>
In [36]: (Train[ 'residence_area_type'].value_counts()/len(Train[ 'residence_area_type'])) .plot .bar()
Out[36]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc1ebee10>
In [37]: Train['target'].value_counts().plot.bar()
Out[37]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2dc1f60>
In [38]: (Train[ 'target'].value_counts() /len(Train['target'])).plot.bar()
Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc36f9c18>
Continious - Continious Bivariate Analysis
In [39]: Train[[ "perc_premium_paid_by_cash_credit" ,"Age", "Income", "application_underwriting_score" ,"target"
]].corr()
Out[39]:
perc_premium_paid_by_cash_credit
Age
Income
application_underwriting_score
target
perc_premium_paid_by_cash_credit Age Income application_underwriting_score
1.000000 -0.259131 -0.031868 -0.142670
-0.259131 1.000000 0.029308 0.049888
-0.031868 0.029308 1.000000 0.085746
-0.142670 0.049888 0.085746 1.000000
-0.240980 0.095103 0.016541 0.068715
In [40]: Train.head()
Out[40]:
id perc_premium_paid_by_cash_credit Age Income
Count_3- Count_6-
Count_more_than_12_mon
6_months_late 12_months_late
0 110936 0.429 33.035616 355060 0.0 0.0
1 41492 0.010 59.030137 315150 0.0 0.0
2 31300 0.917 48.030137 84140 2.0 3.0
3 19415 0.049 42.030137 250510 0.0 0.0
4 99379 0.052 86.027397 198680 0.0 0.0
Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc37770b8>
Out[42]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc399f0b8>
Out[43]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3a41588>
Out[44]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3ac50b8>
Out[45]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3b20748>
Out[46]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3ac5c50>
Out[47]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3be6080>
Out[48]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3c41a20>
Out[49]:
target 0 1
sourcing_channel
A 2349 40785
B 1066 15446
C 903 11136
D 634 6925
E 46 563
Out[50] :
target 0 1
residence_area_type
Rural 1998 29672
Urban 3000 45183
Out[51]: id 0
perc_premium_paid_by_cash_credit 0
Age 0
Income 0
Count_3 -6_months_lat e 97
Count_6 -12_months_lat e 97
Count_more_than_12_months_late 97
application_underwriting_score 2974
0
0
0
0
no_of_premiums_paid
sourcing_channel
residence_area_type
target
dtype: int64
In [52]: # Replace the application_underwriting_score with the mean
Train['application_underwriting_score'].fillna(99, inplace = True)
In [53]: # Replace in Test set too
Test['application_underwriting_score'].fillna(99, inplace = True)
In [54]: # Now drop null values in Train data set
Train = Train.dropna()
In [55]: #In test data set replace with 0
Test.fillna( 0,inplace = True)
In [56]: # Now check for null values in both test and train data sets
Test.isnull().sum()
Out[56] : id 0
perc_premium_paid_by_cash_credit 0
Age
Income
Count_3 -6_months_late
Count_6 -12_months_lat e
0
0
0
0
Count_more_than_12_months_late 0
application_underwriting_score 0
0
0
0
no_of_premiums_paid
sourcing_channel
residence_area_type
dtype: int64
Out[57]: id 0
perc_premium_paid_by_cash_credit 0
Age
Income
Count_3 -6_months_late
0
0
0
Count_6 -12_months_late 0
Count_more_than_12_months_late 0
application_underwriting_score 0
0
0
0
0
no_of_premiums_paid
sourcing_channel
residence_area_type
target
dtype: int64
In [61]: np.power(Train[ 'Income'],1/5) .plot.hist()
Out[61]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc367d6d8>
In [63]: Train["residence_area_type" ] = number .fit_transform(Train["residence_area_type"].astype('str'))
Test["residence_area_type" ] = number.fit_transform(Test["residence_area_type"].astype("str"))
In [64]: Train.corr()
Out[64]:
id perc_premium_paid_by_cash_credit Age Income
Count_3-
6_months_late 12_m
id 1.000000 -0.004772 0.004306 -0.001816 -0.005660
perc_premium_paid_by_cash_credit -0.004772 1.000000 -0.255676 -0.031341 0.214470
Age 0.004306 -0.255676 1.000000 0.030214 -0.057878
Income -0.001816 -0.031341 0.030214 1.000000 -0.001403
Count_3-6_months_late -0.005660 0.214470 -0.057878 -0.001403 1.000000
Count_6-12_months_late -0.002125 0.214951 -0.072484 -0.017347 0.204228
Count_more_than_12_months_late 0.003424 0.168125 -0.059602 -0.012399 0.296085
application_underwriting_score -0.002084 -0.138657 0.043666 0.062699 -0.081463
sourcing_channel 0.001364 0.082878 -0.215420 0.059663 0.058662
residence_area_type 0.001803 -0.002013 0.000577 0.003470 0.001592
target -0.005365 -0.237210 0.093163 0.015911 -0.248900
In [72]: Test_1 = Test.drop(['Age', 'Income', 'application_underwriting_score','residence_area_type','sourcin
g_channel'], axis =1)
In [67]: y_train = Train['target']
In [68]: import sklearn
In [69]: from sklearn.tree import DecisionTreeClassifier
In [70]: model_1 = DecisionTreeClassifier()
In [71]: # Training The data set
model_1.fit(x_train,y_train)
Out[71]: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
In [80]: # Create a new variable target in test set
Test["target"] = model_1.predict(Test_1)
In [75]: # score of our model on train data set
model_1.score(x_train,y_train)
Out[75]: 1.0
In [81]: Test_2 = Test[["id", "target" ]]
In [83]: Test_2. set_index( 'id') .head()
Out[83]:
target
id
649 1
81136 1
70762 1
53935 1
15476 1