SlideShare a Scribd company logo
1 of 1
Problem Statement:
Your client is an Insurance company and they need your help in building a model to predict whether the policyholder (customer ) will
pay next premium on time or not.
By looking at the problem statement we can understand that "This is a classification problem"
Hypothesis Generation
1. Clients with high income will have higher chances of paying next premium
2. Clients with high dafault rate has higher chances of the not paying next premium
3. Clients with low income has higher chances of not paying next premium
4. Clients with medium income has higher chances of not paying premium if premium cost high
5. Clients with higher age has higher chances of not paying premium
In [2]: # Loading libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [3]: # It gives the directory
import o s
os.getcwd()
DataExtraction
In [6]: # Load the data sets
Test = pd.read_csv("test.csv")
Train = pd.read_csv( "train.csv")
Exploratory DataAnalysis
Steps in EDA:
1. Variable Identification
2. Univariate Analysis for Continious Variables
3. Univariate Analysis for Categorical Variables
4. Bivariate analysis for both Continious and categorical variables
5. Trating missing values
6. Outlier Treatment
7. Variable Transfor mati on
In [12]: # Variable Description
# id - Unique ID of the policy
# perc_premium_paid_by_cash_credit - Percentage of premium amount paid by cash or credit card
# age_in_days - Age in days of policy holder
# Income - Monthly Income of policy holder #
Count_3 -6_months_late - No of premiums late by 3 to 6 months # Count_6 -
12_months_late - No of premiums late by 6 to 12 months
# Count_more_than_12_months_late - No of premiums late by more than 12 months
# application_underwriting_score - No applications under the score of 90 are insured #
no_of_premiums_paid - Total premiums paid on time till now
# sourcing_channel - Sourcing channel for application
# residence_area_type - Area type of Residence (Urban/Rural)
In [14]: # Genrally data types int, float are continious variables but some times intiger varibles also in ca
tegorical in nature
Train["Count_3-6_months_late"]. value_counts()
In [18]: # If you notice age_in_days variable is in days, lets transform it to years
Train['age_in_days'] = Train["age_in_days"]/365
Note:
1. perc_premium _pai d_by_cash_cr edit this variable has no outliers
2. Age distribution is normal above 90 is a outlier
3. application_underwriting_score is skewed left side and score below 98 is an outlier
Univariate Analysis for Categorical Variables
Note
1.( Count_3- 6_m onths_late Count_6-12_m onths_l ate Count_more_than_12_m onths_late) most of the times these variables are 0, very
rarely people misses to pay for3-12 months
1. no_of_premi ums_pai d' most number of clients paid 8 times then trend keep decresing
2. sourcing_channel: 50% of the clients came from channel A
3. residence_area_type: Nearly 60% clients are from the urban area
4. Less then 10% clients are defaulters
In [41]: # Categorical - Continious Bivariate analysis
Train.groupby("Count_3-6_months_late")['Income'].mean() .plot .bar()
# No clear trend but some clients has higher and higher default rate
#Some clients has low Income and higher default rate
In [42]: Train.groupby("Count_6-12_months_late")[ 'Income'].mean().plot.bar()
# No clear trend but some clients has higher and higher default rate
#Some clients has low Income and higher default rate
In [43]: Train.groupby("Count_more_than_12_months_late")['Income'].mean() .plot .bar()
# This attribute has some clear trend higher the Income Higher the default rate
In [44]: Train.groupby("sourcing_channel")['Income'].mean() .plot.bar()
# Sorcing chanel E has high Income
In [45]: Train.groupby("residence_area_type")['Income' ].mean().plot.bar()
# Incomes of Rural and Urban Clients has almost same.
In [46]: Train.groupby("target")['Income'].mean() .plot.bar()
# Income less then 175k has higher chances of default rate
In [47]: Train.groupby("target")['Age']. mean().plot.bar()
# Age less then 50 has higher chances of default rate
In [48]: fig= plt.figure(figsize=(18,7))
Train.groupby("no_of_premiums_paid")['Age'].mean() .plot.bar()
# Age increses no of times of premium pay increses
Categorial - Categorial Bivariate Analysis
In [49]: # Create 2-way tables
pd.crosstab(Train['sourcing_channel' ], Train[ "target"])
# With this we can understad that sourcing chanel 'A' has low Income and high chances of default
# Overall percentage wise channel B has higher chances of default
In [50]: pd.crosstab(Train['residence_area_type'], Train["target"])
#Rural clients has higher chances of default
Missing valuestreatment
In [51]: Train.isnull().sum()
In [57]: Train.isnull().sum()
In [58]: # Drop un wanted columns
Train = Train.drop(['no_of_premiums_paid' ], axis = 1 )
In [59]: Test = Test.drop(['no_of_premiums_paid'], axis = 1)
Outlier Treatment
In [60]: # replace age above anything 90 with mean
import numpy as np
Train.loc[Train[ "Age"] >90 , 'Age'] = np.mean(Train["Age"])
DataTransformation
In [62]: # We have convert categorical variables to numbers, editing manually takes lot of time so we will us
e LabelEncoder function
from sklearn.preprocessing import LabelEncoder
number = LabelEncoder()
Train["sourcing_channel"] = number.fit_transform(Train[ "sourcing_channel"].astype('str'))
Test["sourcing_channel"] = number .fit_transform(Test[ "sourcing_channel"] .astype("str"))
Model Building
In [65]: # Drop un corelated variables to our target variable in both test and train data set
x_train = Train.drop([ 'target','Age' , 'Income' , 'application_underwriting_score' ,'residence_area_typ
e','sourcing_channel' ], axis =1)
Out[3]: 'C: Users dell'
In [4]: # To change the directory of the system
os.chdir('C:Users dellDownloads')
In [5]: os.getcwd()
Out[5]: 'C: Users dellDownloads'
In [7]: # Check the data weather it loaded or not
Test.head()
Out[7]:
id perc_premium_paid_by_cash_credit age_in_days IncomeCount_3- Count_6- Count_more_than_12_mo
6_months_late 12_months_late
0 649 0.001 27384 51150 0.0 0.0
1 81136 0.124 23735 285140 0.0 0.0
2 70762 1.000 17170 186030 0.0 0.0
3 53935 0.198 16068 123540 0.0 0.0
4 15476 0.041 10591 200020 1.0 0.0
In [8]: Train.head()
Out[8]:
id perc_premium_paid_by_cash_credit age_in_days Income
Count_3- Count_6-
Count_more_than_12_m
6_months_late 12_months_late
0 110936 0.429 12058 355060 0.0 0.0
1 41492 0.010 21546 315150 0.0 0.0
2 31300 0.917 17531 84140 2.0 3.0
3 19415 0.049 15341 250510 0.0 0.0
4 99379 0.052 31400 198680 0.0 0.0
In [9]: Test.info()
<class 'pandas .core.frame.DataFrame'>
RangeIndex: 34224 entries, 0 to 34223
Data columns (total 11 columns):
id 34224 non-null int64
perc_premium_paid_by_cash_credit 34224 non-null float64
age_in_days 34224 non-null int64
Income 34224 non-null int64
Count_3-6_months_late 34193 non-null float64
Count_6-12_months_late 34193 non-null float64
Count_more_than_ 12_months_late 34193 non-null float64
application_underwriting_score 32901 non-null float64
no_of_premiums_paid 34224 non-null int64
sourcing_channel 34224 non-null object
residence_area_type 34224 non-null object
dtypes: float64( 5), int64(4), object(2)
memory usage : 2. 9+ MB
In [10]: Train.info()
<class 'pandas .core.frame.DataFrame'>
RangeIndex: 79853 entries, 0 to 79852
Data columns (total 13 columns):
id 79853 non-null int64
perc_premium_paid_by_cash_credit 79853 non-null float64
age_in_days 79853 non-null int64
Income 79853 non-null int64
Count_3-6_months_late 79756 non-null float64
Count_6-12_months_late 79756 non-null float64
Count_more_than_ 12_months_late 79756 non-null float64
application_underwriting_score 76879 non-null float64
no_of_premiums_paid 79853 non-null int64
sourcing_channel 79853 non-null object
residence_area_type 79853 non-null object
premium 79853 non-null int64
target 79853 non-null int64
dtypes: float64(5), int64(6), object(2)
memory usage: 7.9+ MB
If you observe carefully Test data set have only 11 columns where as Train data set has 13 columns. we will remove the primium
column in Train data set
In [11]: Train = Train .drop(['premium'], axis =1)
Vaiable Identification
In [13]: # Identify continious and categorical variables
Train.dtypes
Out[13] : id int64
perc_premium_paid_by_cash_credit float64
age_in_day s int64
Income int64
Count_3-6_months_late float64
Count_6-12_months_late float64
Count_more_than_ 12_months_late float64
application_underwriting_score float64
no_of_premiums_paid int64
sourcing_channel object
residence_area_type object
target int64
dtype: object
By above output we notice that sourcing_channel, residence_ar ea_type are categorical variables
Out[14]: 0.0 66801
1.0 8826
2.0 2519
3.0 954
4.0 374
5.0 168
6.0 68
7.0 23
8.0 15
9.0 4
11.0 1
12.0 1
13.0 1
10.0 1
Name: Count_3-6_months_late, dtype: int64
In [15]: Train["Count_6-12_months_late"]. value_counts()
Out[15]: 0.0 75831
1.0 2680
2.0 693
3.0 317
4.0 130
5.0 46
6.0 26
7.0 11
8.0 5
10.0 4
9.0 4
14.0 2
11.0 2
13.0 2
17.0 1
12.0 1
15.0 1
Name: Count_6-12_months_late, dtype: int64
In [16]: Train["Count_more_than_12_months_late"].value_counts()
Out[16]: 0.0 76038
1.0 2996
2.0 498
3.0 151
4.0 48
5.0 13
6.0 6
7.0 3
8.0 2
11.0 1
Name: Count_more_than_12_months_late, dtype: int64
In [17]: Train["no_of_premiums_paid"]. value_counts()
Out[17]: 8 7184
9 7158
10 6873
7 6623
11 6395
6 5635
12 5407
13 4752
5 4215
14 3988
15 3264
4 2907
16 2678
17 2148
18 1799
3 1746
19 1355
20 1134
21 838
2 726
22 713
23 503
24 386
25 305
26 241
27 186
28 152
29 119
30 91
31 61
32 51
33 43
34 38
35 31
36 23
37 14
38 8
42 7
40 6
41 6
39 5
47 5
44 4
43 3
45 3
56 3
48 3
50 3
51 3
58 2
52 2
53 2
54 2
59 1
55 1
49 1
60 1
Name: no_of_premiums_paid, dtype: int64
In this section we noticed that below variables arecategorical in
nature
Count_3-6_m onths_late
Count_6-12_months_late
Count_more_than_12_months_late
application_underwriting_score
no_of_premi ums_pai d
sourcing_channel residence_ar ea_type
premium
targe
In [19]: Test['age_in_days'] = Test["age_in_days"]/365
In [20]: # Now rename 'age_in _days variable to Age'
Train = Train.rename(columns={"age_in_days": "Age"})
In [21]: Test = Test.rename(columns ={"age_in_days" : "Age"})
Univariate Analysis of Continious variables
In [22]: Train[[ "perc_premium_paid_by_cash_credit" ,"Age", "Income", "application_underwriting_score"]].descri
be()
Out[22]:
count
mean
std
min
25%
50%
75%
perc_premium_paid_by_cash_credit Age Income application_underwriting_score
79853.000000 79853.000000 7.985300e+04 76879.000000
0.314288 51.634786 2.088472e+05 99.067291
0.334915 14.270463 4.965826e+05 0.739799
0.000000 21.013699 2.403000e+04 91.900000
0.034000 41.024658 1.080100e+05 98.810000
0.167000 51.027397 1.665600e+05 99.210000
0.538000 62.016438 2.520900e+05 99.540000
max 1.000000 103.019178 9.026260e+07 99.890000
In [23]: Train['perc_premium_paid_by_cash_credit'].plot.hist()
Out[23]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc24e2550>
In [24]: Train['perc_premium_paid_by_cash_credit'].plot.box()
Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc220d780>
In [25]: Train['Age'] .plot.hist()
Out[25]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc228bb00>
In [26]: Train['Age'] .plot.box()
Out[26]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc230e278>
In [27]: Train['application_underwriting_score'].plot.hist()
Out[27]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc23bcba8>
In [28]: Train['application_underwriting_score'].plot.box()
Out[28]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2432c18>
In [29]: # Creating frequency tables and bar plots for the categorical variables
Train['Count_3-6_months_late'].value_counts() .plot.bar()
Out[29]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2490978>
In [30]: Train['Count_6-12_months_late'] .value_counts() .plot.bar()
Out[30]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2a2d940>
In [31]: Train['Count_more_than_12_months_late'].value_counts(). plot.bar()
Out[31]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2acc320>
In [32]: Train['application_underwriting_score'].value_counts(). plot.hist()
Out[32]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2caab70>
In [33]: fig= plt.figure(figsize=(12 ,7))
Train['no_of_premiums_paid'].value_counts().plot.bar()
Out[33]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2ca04e0>
In [34]: Train['sourcing_channel'].value_counts().plot. bar()
Out[34]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc38047f0>
In [35]: Train['residence_area_type' ].value_counts().plot.bar()
Out[35]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc382f128>
In [36]: (Train[ 'residence_area_type'].value_counts()/len(Train[ 'residence_area_type'])) .plot .bar()
Out[36]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc1ebee10>
In [37]: Train['target'].value_counts().plot.bar()
Out[37]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2dc1f60>
In [38]: (Train[ 'target'].value_counts() /len(Train['target'])).plot.bar()
Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc36f9c18>
Continious - Continious Bivariate Analysis
In [39]: Train[[ "perc_premium_paid_by_cash_credit" ,"Age", "Income", "application_underwriting_score" ,"target"
]].corr()
Out[39]:
perc_premium_paid_by_cash_credit
Age
Income
application_underwriting_score
target
perc_premium_paid_by_cash_credit Age Income application_underwriting_score
1.000000 -0.259131 -0.031868 -0.142670
-0.259131 1.000000 0.029308 0.049888
-0.031868 0.029308 1.000000 0.085746
-0.142670 0.049888 0.085746 1.000000
-0.240980 0.095103 0.016541 0.068715
In [40]: Train.head()
Out[40]:
id perc_premium_paid_by_cash_credit Age Income
Count_3- Count_6-
Count_more_than_12_mon
6_months_late 12_months_late
0 110936 0.429 33.035616 355060 0.0 0.0
1 41492 0.010 59.030137 315150 0.0 0.0
2 31300 0.917 48.030137 84140 2.0 3.0
3 19415 0.049 42.030137 250510 0.0 0.0
4 99379 0.052 86.027397 198680 0.0 0.0
Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc37770b8>
Out[42]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc399f0b8>
Out[43]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3a41588>
Out[44]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3ac50b8>
Out[45]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3b20748>
Out[46]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3ac5c50>
Out[47]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3be6080>
Out[48]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3c41a20>
Out[49]:
target 0 1
sourcing_channel
A 2349 40785
B 1066 15446
C 903 11136
D 634 6925
E 46 563
Out[50] :
target 0 1
residence_area_type
Rural 1998 29672
Urban 3000 45183
Out[51]: id 0
perc_premium_paid_by_cash_credit 0
Age 0
Income 0
Count_3 -6_months_lat e 97
Count_6 -12_months_lat e 97
Count_more_than_12_months_late 97
application_underwriting_score 2974
0
0
0
0
no_of_premiums_paid
sourcing_channel
residence_area_type
target
dtype: int64
In [52]: # Replace the application_underwriting_score with the mean
Train['application_underwriting_score'].fillna(99, inplace = True)
In [53]: # Replace in Test set too
Test['application_underwriting_score'].fillna(99, inplace = True)
In [54]: # Now drop null values in Train data set
Train = Train.dropna()
In [55]: #In test data set replace with 0
Test.fillna( 0,inplace = True)
In [56]: # Now check for null values in both test and train data sets
Test.isnull().sum()
Out[56] : id 0
perc_premium_paid_by_cash_credit 0
Age
Income
Count_3 -6_months_late
Count_6 -12_months_lat e
0
0
0
0
Count_more_than_12_months_late 0
application_underwriting_score 0
0
0
0
no_of_premiums_paid
sourcing_channel
residence_area_type
dtype: int64
Out[57]: id 0
perc_premium_paid_by_cash_credit 0
Age
Income
Count_3 -6_months_late
0
0
0
Count_6 -12_months_late 0
Count_more_than_12_months_late 0
application_underwriting_score 0
0
0
0
0
no_of_premiums_paid
sourcing_channel
residence_area_type
target
dtype: int64
In [61]: np.power(Train[ 'Income'],1/5) .plot.hist()
Out[61]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc367d6d8>
In [63]: Train["residence_area_type" ] = number .fit_transform(Train["residence_area_type"].astype('str'))
Test["residence_area_type" ] = number.fit_transform(Test["residence_area_type"].astype("str"))
In [64]: Train.corr()
Out[64]:
id perc_premium_paid_by_cash_credit Age Income
Count_3-
6_months_late 12_m
id 1.000000 -0.004772 0.004306 -0.001816 -0.005660
perc_premium_paid_by_cash_credit -0.004772 1.000000 -0.255676 -0.031341 0.214470
Age 0.004306 -0.255676 1.000000 0.030214 -0.057878
Income -0.001816 -0.031341 0.030214 1.000000 -0.001403
Count_3-6_months_late -0.005660 0.214470 -0.057878 -0.001403 1.000000
Count_6-12_months_late -0.002125 0.214951 -0.072484 -0.017347 0.204228
Count_more_than_12_months_late 0.003424 0.168125 -0.059602 -0.012399 0.296085
application_underwriting_score -0.002084 -0.138657 0.043666 0.062699 -0.081463
sourcing_channel 0.001364 0.082878 -0.215420 0.059663 0.058662
residence_area_type 0.001803 -0.002013 0.000577 0.003470 0.001592
target -0.005365 -0.237210 0.093163 0.015911 -0.248900
In [72]: Test_1 = Test.drop(['Age', 'Income', 'application_underwriting_score','residence_area_type','sourcin
g_channel'], axis =1)
In [67]: y_train = Train['target']
In [68]: import sklearn
In [69]: from sklearn.tree import DecisionTreeClassifier
In [70]: model_1 = DecisionTreeClassifier()
In [71]: # Training The data set
model_1.fit(x_train,y_train)
Out[71]: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
In [80]: # Create a new variable target in test set
Test["target"] = model_1.predict(Test_1)
In [75]: # score of our model on train data set
model_1.score(x_train,y_train)
Out[75]: 1.0
In [81]: Test_2 = Test[["id", "target" ]]
In [83]: Test_2. set_index( 'id') .head()
Out[83]:
target
id
649 1
81136 1
70762 1
53935 1
15476 1

More Related Content

Similar to End to-end machine learning project for beginners

Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptxAniket Patil
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptxpatilaniket2418
 
Write a banking program that simulates the operation of your local ba.docx
 Write a banking program that simulates the operation of your local ba.docx Write a banking program that simulates the operation of your local ba.docx
Write a banking program that simulates the operation of your local ba.docxajoy21
 
Cross selling credit card to existing debit card customers
Cross selling credit card to existing debit card customersCross selling credit card to existing debit card customers
Cross selling credit card to existing debit card customersSaurabh Singh
 
Credit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceCredit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceIRJET Journal
 
Credit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceCredit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceIRJET Journal
 
Insurance Optimization
Insurance OptimizationInsurance Optimization
Insurance OptimizationAlbert Chu
 
Data Mining to Classify Telco Churners
Data Mining to Classify Telco ChurnersData Mining to Classify Telco Churners
Data Mining to Classify Telco ChurnersMohitMhapuskar
 
Loan Eligibility Checker
Loan Eligibility CheckerLoan Eligibility Checker
Loan Eligibility CheckerKiranVodela
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)TarunPaparaju
 
EDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdfEDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdfSourabhH1
 
Supervised learning
Supervised learningSupervised learning
Supervised learningJohnson Ubah
 
Classification Algorithms with Attribute Selection: an evaluation study using...
Classification Algorithms with Attribute Selection: an evaluation study using...Classification Algorithms with Attribute Selection: an evaluation study using...
Classification Algorithms with Attribute Selection: an evaluation study using...Eswar Publications
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network ModelEric Esajian
 
Data Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate PredictorData Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate PredictorGreg Werner
 
Wooing the Best Bank Deposit Customers
Wooing the Best Bank Deposit CustomersWooing the Best Bank Deposit Customers
Wooing the Best Bank Deposit CustomersLucinda Linde
 
DF16 Imprivata - Getting Started with Formulas
DF16 Imprivata - Getting Started with FormulasDF16 Imprivata - Getting Started with Formulas
DF16 Imprivata - Getting Started with FormulasChristophe Rustici
 
Predicting model for prices of used cars
Predicting model for prices of used carsPredicting model for prices of used cars
Predicting model for prices of used carsHARPREETSINGH1862
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIVikas Virani
 

Similar to End to-end machine learning project for beginners (20)

Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
 
Write a banking program that simulates the operation of your local ba.docx
 Write a banking program that simulates the operation of your local ba.docx Write a banking program that simulates the operation of your local ba.docx
Write a banking program that simulates the operation of your local ba.docx
 
Cross selling credit card to existing debit card customers
Cross selling credit card to existing debit card customersCross selling credit card to existing debit card customers
Cross selling credit card to existing debit card customers
 
Credit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceCredit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data Science
 
Credit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data ScienceCredit Card Fraud Detection Using Machine Learning & Data Science
Credit Card Fraud Detection Using Machine Learning & Data Science
 
Insurance Optimization
Insurance OptimizationInsurance Optimization
Insurance Optimization
 
Data Mining to Classify Telco Churners
Data Mining to Classify Telco ChurnersData Mining to Classify Telco Churners
Data Mining to Classify Telco Churners
 
Loan Eligibility Checker
Loan Eligibility CheckerLoan Eligibility Checker
Loan Eligibility Checker
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
 
EDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdfEDA_Assignment_Sourabh S Hubballi.pdf
EDA_Assignment_Sourabh S Hubballi.pdf
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Classification Algorithms with Attribute Selection: an evaluation study using...
Classification Algorithms with Attribute Selection: an evaluation study using...Classification Algorithms with Attribute Selection: an evaluation study using...
Classification Algorithms with Attribute Selection: an evaluation study using...
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
Data Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate PredictorData Science Salon Miami Example - Churn Rate Predictor
Data Science Salon Miami Example - Churn Rate Predictor
 
Wooing the Best Bank Deposit Customers
Wooing the Best Bank Deposit CustomersWooing the Best Bank Deposit Customers
Wooing the Best Bank Deposit Customers
 
DF16 Imprivata - Getting Started with Formulas
DF16 Imprivata - Getting Started with FormulasDF16 Imprivata - Getting Started with Formulas
DF16 Imprivata - Getting Started with Formulas
 
Predicting model for prices of used cars
Predicting model for prices of used carsPredicting model for prices of used cars
Predicting model for prices of used cars
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMI
 

Recently uploaded

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Recently uploaded (20)

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

End to-end machine learning project for beginners

  • 1. Problem Statement: Your client is an Insurance company and they need your help in building a model to predict whether the policyholder (customer ) will pay next premium on time or not. By looking at the problem statement we can understand that "This is a classification problem" Hypothesis Generation 1. Clients with high income will have higher chances of paying next premium 2. Clients with high dafault rate has higher chances of the not paying next premium 3. Clients with low income has higher chances of not paying next premium 4. Clients with medium income has higher chances of not paying premium if premium cost high 5. Clients with higher age has higher chances of not paying premium In [2]: # Loading libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt In [3]: # It gives the directory import o s os.getcwd() DataExtraction In [6]: # Load the data sets Test = pd.read_csv("test.csv") Train = pd.read_csv( "train.csv") Exploratory DataAnalysis Steps in EDA: 1. Variable Identification 2. Univariate Analysis for Continious Variables 3. Univariate Analysis for Categorical Variables 4. Bivariate analysis for both Continious and categorical variables 5. Trating missing values 6. Outlier Treatment 7. Variable Transfor mati on In [12]: # Variable Description # id - Unique ID of the policy # perc_premium_paid_by_cash_credit - Percentage of premium amount paid by cash or credit card # age_in_days - Age in days of policy holder # Income - Monthly Income of policy holder # Count_3 -6_months_late - No of premiums late by 3 to 6 months # Count_6 - 12_months_late - No of premiums late by 6 to 12 months # Count_more_than_12_months_late - No of premiums late by more than 12 months # application_underwriting_score - No applications under the score of 90 are insured # no_of_premiums_paid - Total premiums paid on time till now # sourcing_channel - Sourcing channel for application # residence_area_type - Area type of Residence (Urban/Rural) In [14]: # Genrally data types int, float are continious variables but some times intiger varibles also in ca tegorical in nature Train["Count_3-6_months_late"]. value_counts() In [18]: # If you notice age_in_days variable is in days, lets transform it to years Train['age_in_days'] = Train["age_in_days"]/365 Note: 1. perc_premium _pai d_by_cash_cr edit this variable has no outliers 2. Age distribution is normal above 90 is a outlier 3. application_underwriting_score is skewed left side and score below 98 is an outlier Univariate Analysis for Categorical Variables Note 1.( Count_3- 6_m onths_late Count_6-12_m onths_l ate Count_more_than_12_m onths_late) most of the times these variables are 0, very rarely people misses to pay for3-12 months 1. no_of_premi ums_pai d' most number of clients paid 8 times then trend keep decresing 2. sourcing_channel: 50% of the clients came from channel A 3. residence_area_type: Nearly 60% clients are from the urban area 4. Less then 10% clients are defaulters In [41]: # Categorical - Continious Bivariate analysis Train.groupby("Count_3-6_months_late")['Income'].mean() .plot .bar() # No clear trend but some clients has higher and higher default rate #Some clients has low Income and higher default rate In [42]: Train.groupby("Count_6-12_months_late")[ 'Income'].mean().plot.bar() # No clear trend but some clients has higher and higher default rate #Some clients has low Income and higher default rate In [43]: Train.groupby("Count_more_than_12_months_late")['Income'].mean() .plot .bar() # This attribute has some clear trend higher the Income Higher the default rate In [44]: Train.groupby("sourcing_channel")['Income'].mean() .plot.bar() # Sorcing chanel E has high Income In [45]: Train.groupby("residence_area_type")['Income' ].mean().plot.bar() # Incomes of Rural and Urban Clients has almost same. In [46]: Train.groupby("target")['Income'].mean() .plot.bar() # Income less then 175k has higher chances of default rate In [47]: Train.groupby("target")['Age']. mean().plot.bar() # Age less then 50 has higher chances of default rate In [48]: fig= plt.figure(figsize=(18,7)) Train.groupby("no_of_premiums_paid")['Age'].mean() .plot.bar() # Age increses no of times of premium pay increses Categorial - Categorial Bivariate Analysis In [49]: # Create 2-way tables pd.crosstab(Train['sourcing_channel' ], Train[ "target"]) # With this we can understad that sourcing chanel 'A' has low Income and high chances of default # Overall percentage wise channel B has higher chances of default In [50]: pd.crosstab(Train['residence_area_type'], Train["target"]) #Rural clients has higher chances of default Missing valuestreatment In [51]: Train.isnull().sum() In [57]: Train.isnull().sum() In [58]: # Drop un wanted columns Train = Train.drop(['no_of_premiums_paid' ], axis = 1 ) In [59]: Test = Test.drop(['no_of_premiums_paid'], axis = 1) Outlier Treatment In [60]: # replace age above anything 90 with mean import numpy as np Train.loc[Train[ "Age"] >90 , 'Age'] = np.mean(Train["Age"]) DataTransformation In [62]: # We have convert categorical variables to numbers, editing manually takes lot of time so we will us e LabelEncoder function from sklearn.preprocessing import LabelEncoder number = LabelEncoder() Train["sourcing_channel"] = number.fit_transform(Train[ "sourcing_channel"].astype('str')) Test["sourcing_channel"] = number .fit_transform(Test[ "sourcing_channel"] .astype("str")) Model Building In [65]: # Drop un corelated variables to our target variable in both test and train data set x_train = Train.drop([ 'target','Age' , 'Income' , 'application_underwriting_score' ,'residence_area_typ e','sourcing_channel' ], axis =1) Out[3]: 'C: Users dell' In [4]: # To change the directory of the system os.chdir('C:Users dellDownloads') In [5]: os.getcwd() Out[5]: 'C: Users dellDownloads' In [7]: # Check the data weather it loaded or not Test.head() Out[7]: id perc_premium_paid_by_cash_credit age_in_days IncomeCount_3- Count_6- Count_more_than_12_mo 6_months_late 12_months_late 0 649 0.001 27384 51150 0.0 0.0 1 81136 0.124 23735 285140 0.0 0.0 2 70762 1.000 17170 186030 0.0 0.0 3 53935 0.198 16068 123540 0.0 0.0 4 15476 0.041 10591 200020 1.0 0.0 In [8]: Train.head() Out[8]: id perc_premium_paid_by_cash_credit age_in_days Income Count_3- Count_6- Count_more_than_12_m 6_months_late 12_months_late 0 110936 0.429 12058 355060 0.0 0.0 1 41492 0.010 21546 315150 0.0 0.0 2 31300 0.917 17531 84140 2.0 3.0 3 19415 0.049 15341 250510 0.0 0.0 4 99379 0.052 31400 198680 0.0 0.0 In [9]: Test.info() <class 'pandas .core.frame.DataFrame'> RangeIndex: 34224 entries, 0 to 34223 Data columns (total 11 columns): id 34224 non-null int64 perc_premium_paid_by_cash_credit 34224 non-null float64 age_in_days 34224 non-null int64 Income 34224 non-null int64 Count_3-6_months_late 34193 non-null float64 Count_6-12_months_late 34193 non-null float64 Count_more_than_ 12_months_late 34193 non-null float64 application_underwriting_score 32901 non-null float64 no_of_premiums_paid 34224 non-null int64 sourcing_channel 34224 non-null object residence_area_type 34224 non-null object dtypes: float64( 5), int64(4), object(2) memory usage : 2. 9+ MB In [10]: Train.info() <class 'pandas .core.frame.DataFrame'> RangeIndex: 79853 entries, 0 to 79852 Data columns (total 13 columns): id 79853 non-null int64 perc_premium_paid_by_cash_credit 79853 non-null float64 age_in_days 79853 non-null int64 Income 79853 non-null int64 Count_3-6_months_late 79756 non-null float64 Count_6-12_months_late 79756 non-null float64 Count_more_than_ 12_months_late 79756 non-null float64 application_underwriting_score 76879 non-null float64 no_of_premiums_paid 79853 non-null int64 sourcing_channel 79853 non-null object residence_area_type 79853 non-null object premium 79853 non-null int64 target 79853 non-null int64 dtypes: float64(5), int64(6), object(2) memory usage: 7.9+ MB If you observe carefully Test data set have only 11 columns where as Train data set has 13 columns. we will remove the primium column in Train data set In [11]: Train = Train .drop(['premium'], axis =1) Vaiable Identification In [13]: # Identify continious and categorical variables Train.dtypes Out[13] : id int64 perc_premium_paid_by_cash_credit float64 age_in_day s int64 Income int64 Count_3-6_months_late float64 Count_6-12_months_late float64 Count_more_than_ 12_months_late float64 application_underwriting_score float64 no_of_premiums_paid int64 sourcing_channel object residence_area_type object target int64 dtype: object By above output we notice that sourcing_channel, residence_ar ea_type are categorical variables Out[14]: 0.0 66801 1.0 8826 2.0 2519 3.0 954 4.0 374 5.0 168 6.0 68 7.0 23 8.0 15 9.0 4 11.0 1 12.0 1 13.0 1 10.0 1 Name: Count_3-6_months_late, dtype: int64 In [15]: Train["Count_6-12_months_late"]. value_counts() Out[15]: 0.0 75831 1.0 2680 2.0 693 3.0 317 4.0 130 5.0 46 6.0 26 7.0 11 8.0 5 10.0 4 9.0 4 14.0 2 11.0 2 13.0 2 17.0 1 12.0 1 15.0 1 Name: Count_6-12_months_late, dtype: int64 In [16]: Train["Count_more_than_12_months_late"].value_counts() Out[16]: 0.0 76038 1.0 2996 2.0 498 3.0 151 4.0 48 5.0 13 6.0 6 7.0 3 8.0 2 11.0 1 Name: Count_more_than_12_months_late, dtype: int64 In [17]: Train["no_of_premiums_paid"]. value_counts() Out[17]: 8 7184 9 7158 10 6873 7 6623 11 6395 6 5635 12 5407 13 4752 5 4215 14 3988 15 3264 4 2907 16 2678 17 2148 18 1799 3 1746 19 1355 20 1134 21 838 2 726 22 713 23 503 24 386 25 305 26 241 27 186 28 152 29 119 30 91 31 61 32 51 33 43 34 38 35 31 36 23 37 14 38 8 42 7 40 6 41 6 39 5 47 5 44 4 43 3 45 3 56 3 48 3 50 3 51 3 58 2 52 2 53 2 54 2 59 1 55 1 49 1 60 1 Name: no_of_premiums_paid, dtype: int64 In this section we noticed that below variables arecategorical in nature Count_3-6_m onths_late Count_6-12_months_late Count_more_than_12_months_late application_underwriting_score no_of_premi ums_pai d sourcing_channel residence_ar ea_type premium targe In [19]: Test['age_in_days'] = Test["age_in_days"]/365 In [20]: # Now rename 'age_in _days variable to Age' Train = Train.rename(columns={"age_in_days": "Age"}) In [21]: Test = Test.rename(columns ={"age_in_days" : "Age"}) Univariate Analysis of Continious variables In [22]: Train[[ "perc_premium_paid_by_cash_credit" ,"Age", "Income", "application_underwriting_score"]].descri be() Out[22]: count mean std min 25% 50% 75% perc_premium_paid_by_cash_credit Age Income application_underwriting_score 79853.000000 79853.000000 7.985300e+04 76879.000000 0.314288 51.634786 2.088472e+05 99.067291 0.334915 14.270463 4.965826e+05 0.739799 0.000000 21.013699 2.403000e+04 91.900000 0.034000 41.024658 1.080100e+05 98.810000 0.167000 51.027397 1.665600e+05 99.210000 0.538000 62.016438 2.520900e+05 99.540000 max 1.000000 103.019178 9.026260e+07 99.890000 In [23]: Train['perc_premium_paid_by_cash_credit'].plot.hist() Out[23]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc24e2550> In [24]: Train['perc_premium_paid_by_cash_credit'].plot.box() Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc220d780> In [25]: Train['Age'] .plot.hist() Out[25]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc228bb00> In [26]: Train['Age'] .plot.box() Out[26]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc230e278> In [27]: Train['application_underwriting_score'].plot.hist() Out[27]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc23bcba8> In [28]: Train['application_underwriting_score'].plot.box() Out[28]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2432c18> In [29]: # Creating frequency tables and bar plots for the categorical variables Train['Count_3-6_months_late'].value_counts() .plot.bar() Out[29]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2490978> In [30]: Train['Count_6-12_months_late'] .value_counts() .plot.bar() Out[30]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2a2d940> In [31]: Train['Count_more_than_12_months_late'].value_counts(). plot.bar() Out[31]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2acc320> In [32]: Train['application_underwriting_score'].value_counts(). plot.hist() Out[32]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2caab70> In [33]: fig= plt.figure(figsize=(12 ,7)) Train['no_of_premiums_paid'].value_counts().plot.bar() Out[33]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2ca04e0> In [34]: Train['sourcing_channel'].value_counts().plot. bar() Out[34]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc38047f0> In [35]: Train['residence_area_type' ].value_counts().plot.bar() Out[35]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc382f128> In [36]: (Train[ 'residence_area_type'].value_counts()/len(Train[ 'residence_area_type'])) .plot .bar() Out[36]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc1ebee10> In [37]: Train['target'].value_counts().plot.bar() Out[37]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc2dc1f60> In [38]: (Train[ 'target'].value_counts() /len(Train['target'])).plot.bar() Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc36f9c18> Continious - Continious Bivariate Analysis In [39]: Train[[ "perc_premium_paid_by_cash_credit" ,"Age", "Income", "application_underwriting_score" ,"target" ]].corr() Out[39]: perc_premium_paid_by_cash_credit Age Income application_underwriting_score target perc_premium_paid_by_cash_credit Age Income application_underwriting_score 1.000000 -0.259131 -0.031868 -0.142670 -0.259131 1.000000 0.029308 0.049888 -0.031868 0.029308 1.000000 0.085746 -0.142670 0.049888 0.085746 1.000000 -0.240980 0.095103 0.016541 0.068715 In [40]: Train.head() Out[40]: id perc_premium_paid_by_cash_credit Age Income Count_3- Count_6- Count_more_than_12_mon 6_months_late 12_months_late 0 110936 0.429 33.035616 355060 0.0 0.0 1 41492 0.010 59.030137 315150 0.0 0.0 2 31300 0.917 48.030137 84140 2.0 3.0 3 19415 0.049 42.030137 250510 0.0 0.0 4 99379 0.052 86.027397 198680 0.0 0.0 Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc37770b8> Out[42]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc399f0b8> Out[43]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3a41588> Out[44]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3ac50b8> Out[45]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3b20748> Out[46]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3ac5c50> Out[47]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3be6080> Out[48]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc3c41a20> Out[49]: target 0 1 sourcing_channel A 2349 40785 B 1066 15446 C 903 11136 D 634 6925 E 46 563 Out[50] : target 0 1 residence_area_type Rural 1998 29672 Urban 3000 45183 Out[51]: id 0 perc_premium_paid_by_cash_credit 0 Age 0 Income 0 Count_3 -6_months_lat e 97 Count_6 -12_months_lat e 97 Count_more_than_12_months_late 97 application_underwriting_score 2974 0 0 0 0 no_of_premiums_paid sourcing_channel residence_area_type target dtype: int64 In [52]: # Replace the application_underwriting_score with the mean Train['application_underwriting_score'].fillna(99, inplace = True) In [53]: # Replace in Test set too Test['application_underwriting_score'].fillna(99, inplace = True) In [54]: # Now drop null values in Train data set Train = Train.dropna() In [55]: #In test data set replace with 0 Test.fillna( 0,inplace = True) In [56]: # Now check for null values in both test and train data sets Test.isnull().sum() Out[56] : id 0 perc_premium_paid_by_cash_credit 0 Age Income Count_3 -6_months_late Count_6 -12_months_lat e 0 0 0 0 Count_more_than_12_months_late 0 application_underwriting_score 0 0 0 0 no_of_premiums_paid sourcing_channel residence_area_type dtype: int64 Out[57]: id 0 perc_premium_paid_by_cash_credit 0 Age Income Count_3 -6_months_late 0 0 0 Count_6 -12_months_late 0 Count_more_than_12_months_late 0 application_underwriting_score 0 0 0 0 0 no_of_premiums_paid sourcing_channel residence_area_type target dtype: int64 In [61]: np.power(Train[ 'Income'],1/5) .plot.hist() Out[61]: <matplotlib.axes._subplots.AxesSubplot at 0x25fc367d6d8> In [63]: Train["residence_area_type" ] = number .fit_transform(Train["residence_area_type"].astype('str')) Test["residence_area_type" ] = number.fit_transform(Test["residence_area_type"].astype("str")) In [64]: Train.corr() Out[64]: id perc_premium_paid_by_cash_credit Age Income Count_3- 6_months_late 12_m id 1.000000 -0.004772 0.004306 -0.001816 -0.005660 perc_premium_paid_by_cash_credit -0.004772 1.000000 -0.255676 -0.031341 0.214470 Age 0.004306 -0.255676 1.000000 0.030214 -0.057878 Income -0.001816 -0.031341 0.030214 1.000000 -0.001403 Count_3-6_months_late -0.005660 0.214470 -0.057878 -0.001403 1.000000 Count_6-12_months_late -0.002125 0.214951 -0.072484 -0.017347 0.204228 Count_more_than_12_months_late 0.003424 0.168125 -0.059602 -0.012399 0.296085 application_underwriting_score -0.002084 -0.138657 0.043666 0.062699 -0.081463 sourcing_channel 0.001364 0.082878 -0.215420 0.059663 0.058662 residence_area_type 0.001803 -0.002013 0.000577 0.003470 0.001592 target -0.005365 -0.237210 0.093163 0.015911 -0.248900 In [72]: Test_1 = Test.drop(['Age', 'Income', 'application_underwriting_score','residence_area_type','sourcin g_channel'], axis =1) In [67]: y_train = Train['target'] In [68]: import sklearn In [69]: from sklearn.tree import DecisionTreeClassifier In [70]: model_1 = DecisionTreeClassifier() In [71]: # Training The data set model_1.fit(x_train,y_train) Out[71]: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') In [80]: # Create a new variable target in test set Test["target"] = model_1.predict(Test_1) In [75]: # score of our model on train data set model_1.score(x_train,y_train) Out[75]: 1.0 In [81]: Test_2 = Test[["id", "target" ]] In [83]: Test_2. set_index( 'id') .head() Out[83]: target id 649 1 81136 1 70762 1 53935 1 15476 1