Single-handedly worked on a data mining project to build a model using logistic regression to identify profitable segments for cross-selling personal loans. A data set of 30,000 records was mined to gain customer insights and 9 distinct variables were identified to target profitable segments. Overall accuracy of the model was 86.03%.
2. 2
TABLE OF CONTENTS
Target segment identification for cross-selling personal loans _____________3
Business Objective _______________________________________________________3
Exploratory data analysis __________________________________________________3
Data definitions ________________________________________________________3
Data set characterizations ________________________________________________4
Descriptive Stats _______________________________________________________4
Outlier’s Detection and Removal ___________________________________________6
Feature Engineering ____________________________________________________6
Information Value test:___________________________________________________7
Hypothesis Formulation and Validation ______________________________________8
Test of MultiColinearity____________________________________________________8
Feature Selection _____________________________________________________10
Building the statistical model _______________________________________________12
Creating the training and testing datasets ___________________________________12
Independent Variable Transformation ______________________________________12
Model Development: Logistic Regression.___________________________________14
Model Performance Measures: ___________________________________________14
Model Performance on Hold out Sample ____________________________________15
Model Implementation / Deployment Strategy ________________________________16
3. 3
TARGET SEGMENT IDENTIFICATION FOR CROSS-
SELLING PERSONAL LOANS
Business Objective
To finds profitable segments to target for cross-selling personal loans.
Exploratory data analysis
In this preliminary step of data analysis, maximum insights will we will summarize the main
characteristics of the data set.
Data definitions
Below are our categorical Data with their descriptions.
Column Name Description
CUST_ID Customer ID - Unique ID
TARGET Target Field - 1: Responder, 0: Non-Responder
GENDER Gender
OCCUPATION Occupation
AGE_BKT Age Bucket
ACC_TYPE Account Type - Saving / Current
FLG_HAS_CC Has Credit Card - 1: Yes, 0: No
FLG_HAS_ANY_CHGS Has any banking charges
FLG_HAS_NOMINEE Has Nominee - 1: Yes, 0: No
FLG_HAS_OLD_LOAN Has any earlier loan - 1: Yes, 0: No
ACC_OP_DATE Account Open Date
Below are our Continuous Data with their descriptions.
Column Name Description
AGE Age of the customer in years
BALANCE Average Monthly Balance
SCR Generic Marketing Score
HOLDING_PERIOD Ability to hold money in the account (Range 0 - 31)
LEN_OF_RLTN_IN_MNTH Length of Relationship in Months
NO_OF_L_CR_TXNS No. of Credit Transactions
NO_OF_L_DR_TXNS No. of Debit Transactions
TOT_NO_OF_L_TXNS Total No. of Transaction
NO_OF_BR_CSH_WDL_DR_TXNS No. of Branch Cash Withdrawal Transactions
NO_OF_ATM_DR_TXNS No. of ATM Debit Transactions
NO_OF_NET_DR_TXNS No. of Net Debit Transactions
NO_OF_MOB_DR_TXNS No. of Mobile Banking Debit Transactions
NO_OF_CHQ_DR_TXNS No. of Cheque Debit Transactions
AMT_ATM_DR Amount Withdrawn from ATM
AMT_BR_CSH_WDL_DR Amount cash withdrawn from Branch
AMT_CHQ_DR Amount debited by Cheque Transactions
4. 4
AMT_NET_DR Amount debited by Net Transactions
AMT_MOB_DR Amount debited by Mobile Banking Transactions
AMT_L_DR Total Amount Debited
AMT_OTH_BK_ATM_USG_CHGS Amount charged by way of the Other Bank ATM usage
AMT_MIN_BAL_NMC_CHGS Amount charged by way Minimum Balance not
maintained
NO_OF_IW_CHQ_BNC_TXNS Amount charged by way Inward Cheque Bounce
NO_OF_OW_CHQ_BNC_TXNS Amount charged by way Outward Cheque Bounce
AVG_AMT_PER_ATM_TXN Avg. Amount withdrawn per ATM Transaction
AVG_AMT_PER_CSH_WDL_TXN Avg. Amount withdrawn per Cash Withdrawal
Transaction
AVG_AMT_PER_CHQ_TXN Avg. Amount debited per Cheque Transaction
AVG_AMT_PER_NET_TXN Avg. Amount debited per Net Transaction
AVG_AMT_PER_MOB_TXN Avg. Amount debited per Mobile Banking Transaction
random Random Number
Data set characterizations
The following insights can be gained from the dataset:
1. Dataset comprises of 20000 observations and 40 characteristics. Out of which one is
dependent variable and rest 39 are independent variables — physico-chemical
characteristics.
2. Among the independent variables, 10 are categorical variables and 29 are continuous
variables. No variable has Null or missing values.
3. Target/Independent variable is categorical and binary in nature where 1 means the person
responded and 0 means the person did not. There 2512 customers who responded.
4. The dataset contains few extra independent variables which can be removed and only the
important ones should be extracted for further investigation.
a. AGE being a continuous data is better suited over AGE_BKT variable for
segmentation. Hence, AGE_BKT variable can be removed.
b. Random variable will not provide any valuable information for segmentation and
hence should be removed.
Descriptive Stats
Below is the descriptive statistics of some variables which can reflect customer’s responses
to personal loan proposition.
AGE BALANCE SCR HOLDING_PERIOD LEN_OF_RLTN_IN_MNTH NO_OF_L_CR_TXNS
count 20000 20000 20000 20000 20000 20000
mean 38.41815 511362.2 440.1503 14.95565 125.2393 12.34805
median 38 231675.8 364 15 125 10
min 21 0 100 1 29 0
25% 30 64754.03 227 7 79 6
50% 38 231675.8 364 15 125 10
75% 46 653876.9 644 22 172 14
5. 5
The Balance variable exhibits huge difference between mean and median. Also, its 3rd
quartile and maximum limit vary by a huge difference. This exhibits presence of
outliers in this variable.
Customers surveyed have a minimum of 1 to maximum to 31 days of holding period.
15 is the average holding period. Hence, Customers with less than 15 days of holding
period can be more prone to consider buying personal loans.
Customers with large number of financial transactions can be one profitable segments for
personal loans
AVG_AMT_PER_CHQ_TXN AVG_AMT_PER_NET_TXN AVG_AMT_PER_MOB_TXN
count 20000 20000 20000
mean 25092.47838 179059.0293 20303.92041
median 8645 0 0
min 0 0 0
25% 0 0 0
50% 8645 0 0
75% 28605 257699 0
max 537842.22 999854 199667
max 55 8360431 999 31 221 75
NO_OF_L_DR_TXNS AMT_L_DR AVG_AMT_PER_ATM_TXN AVG_AMT_PER_CSH_WDL_TXN
count 20000 20000 20000 20000
mean 6.6337 773716.97 7408.839731 242236.4788
median 5 695115 6000 147095
min 0 0 0 0
25% 2 237935.5 0 1265.45
50% 5 695115 6000 147095
75% 7 1078927 13500 385000
max 74 6514921 25000 999640
6. 6
Outlier’s Detection and Removal
As seen in the boxplot diagram below, The Balance variable has many outliers and hence
and hence will be treated using the capping method at 99th percentile probability.
Feature Engineering
1. Variables AMT_OTH_BK_ATM_USG_CHGS and AMT_MIN_BAL_NMC_CHGS can be
combined toone single variable AMT_CHGS asbothvariables are highlighting the same
weakness of customer; not able to keep enough cash in his account.
2. Variables NO_OF_IW_CHQ_BNC_TXNS and NO_OF_OW_CHQ_BNC_TXNS can be
combined into a single variable CHQ_BNCS as both variables highlight the customer’s
inability to adhere to bank’s cheque transfer policy.
7. 7
Information Value test:
Information value is being computed for the all the above Independent variables and theresults
re as below:
The below insights can be drawn from the above summary plot:
1. ACC_OP_DATE should be removed as its information value coefficient is highly
suspicious.
2. All variables with average IVs will have comparatively higher predictive power when
compared to variables with weak and very weak predictive powers.
3. Variables with very weak predictive power will not be sued in future model building.
4. None of the IVs being significantly Strong are showcasing high chances of
Multicollinearity which should be removed in further steps before building the model.
8. 8
Hypothesis Formulation and Validation
Hypothesis Formulated are:
Test of Multicollinearity
1. A test of “Multicollinearity” was conducted among the variables. In some of the variable s
“Perfect Multicollinearity” was found and hence these variables were removed. Above
output is produced using “Alias” method.
H1N: There is a positive relationship between Age of respondents and positive response rate for cross-selling
personal loans.
H2N: There is a positive relationship between Occupation of respondents and positive response rate for cross-
selling personal loans.
H3N: There is a positive relationship between Balance in the account and positive response rate for cross-
selling personal loans.
H4N: There is a positive relationship between Holding period (Ability to hold money in the account) and positive
response rate for cross-selling personal loans.
H7N: There is a positive relationship between customer holding a credit card and positive response rate for
cross-selling personal loans.
H8N: There is a positive relationship between customer holding a credit card and positive response rate for
cross-selling personal loans.
H8N: There is a positive relationship between customer bearing an old loan and positive response rate for cross-
selling personal loans.
H8N: There is a positive relationship between customer’s length of retention (in months) and positive response
rate for cross-selling personal loans.
H8N: There is a positive relationship between penalties charges on the customer due to his/her inability to keep
minimum balance or make extra transaction from other bank’s ATM and positive response rate for cross-selling
personal loans.
H9N: There is a positive relationship between penalties charges on the customer due to his/her inability to keep
minimum balance or make extra transaction from other bank’s ATM and positive response rate for cross-selling
personal loans.
H10N: There is a positive relationship between average number of transactions made by the customer through
mobile and positive response rate for cross-selling personal loans.
H11N: There is a positive relationship between SCR score and positive response rate for cross-selling personal
loans.
9. 9
Below inferences can be gathered from the above test:
1. There exists perfect collinearity between NO_OF_BR_CSH_WDL_DR_TXNS,
NO_OF_ATM_DR_TXNS, NO_OF_NET_DR_TXNS, NO_OF_CHQ_DR_TXNS
NO_OF_MOB_DR_TXNS and NO_OF_L_DR_TXNS. Hence only
NO_OF_L_DR_TXNS should be retained.
2. There exists perfect collinearity between MT_BR_CSH_WDL_DR, AMT_CHQ_DR,
AMT_NET_DR, AMT_MOB_DR, AMT_ATM_DR and AMT_L_DR. Hence only
AMT_L_DR should be retained. All other variables will be removed.
3. Below is the output of the multicollinearity test conducted using VIF function in r:
Variables with output coefficient more than 2.5 show significant multicollinearity.
Variable
VIF(fit) -
Multicollinearity
HOLDING_PERIOD 1.257859776
LEN_OF_RLTN_IN_MNTH 1.002948848
NO_OF_L_CR_TXNS 15490.28716
NO_OF_L_DR_TXNS 6182.098268
TOT_NO_OF_L_TXNS 33083.68875
FLG_HAS_CC 1.005628463
AMT_L_DR 6.401190918
FLG_HAS_ANY_CHGS 1.429527274
AVG_AMT_PER_ATM_TXN 1.188219408
AVG_AMT_PER_CSH_WDL_TXN 1.948792138
AVG_AMT_PER_CHQ_TXN 3.192796932
AVG_AMT_PER_NET_TXN 2.097151084
AVG_AMT_PER_MOB_TXN 1.450994394
FLG_HAS_NOMINEE 1.002334598
10. 10
FLG_HAS_OLD_LOAN 1.002379004
AMT_CHGS 1.492890736
CHQ_BNCS 1.259316183
BAL_CAP 1.01443018
SCR 1.008704295
4. The above variables are then treated using the VIF method. Any variables with
multicollinearity more than 2.5 are removed and rest are retained.
Variables after Collinearity reduction V1 V2 V3 V4
HOLDING_PERIOD 1.25786 1.257162 1.254599 1.097884
LEN_OF_RLTN_IN_MNTH 1.002949 1.002539 1.002507 1.002506
NO_OF_L_CR_TXNS 15490.29 1.710236 1.709707 1.339096
NO_OF_L_DR_TXNS 6182.098 3.474179 2.847908 1.339096
TOT_NO_OF_L_TXNS 33083.69 NA NA NA
FLG_HAS_CC 1.005628 1.003904 1.003877 1.003318
AMT_L_DR 6.401191 6.399572 NA NA
FLG_HAS_ANY_CHGS 1.429527 1.429423 1.429279 1.421228
AVG_AMT_PER_ATM_TXN 1.188219 1.184399 1.184338 1.13089
AVG_AMT_PER_CSH_WDL_TXN 1.948792 1.948328 1.032795 1.024738
AVG_AMT_PER_CHQ_TXN 3.192797 3.192078 1.385721 1.212004
AVG_AMT_PER_NET_TXN 2.097151 2.097056 1.076701 1.076015
AVG_AMT_PER_MOB_TXN 1.450994 1.450993 1.324462 1.245637
FLG_HAS_NOMINEE 1.002335 1.002298 1.002275 1.002073
FLG_HAS_OLD_LOAN 1.002379 1.002276 1.002234 1.002189
AMT_CHGS 1.492891 1.492857 1.448165 1.194723
CHQ_BNCS 1.259316 1.259172 1.256689 1.247004
BAL_CAP 1.01443 1.014124 1.014079 1.013422
SCR 1.008704 1.008701 1.008672 1.008648
5. Below are the following variables screened out of the multicollinearity test. All the other
variables are dropped.
Final Variables VFIT – Final output
HOLDING_PERIOD 1.097884
LEN_OF_RLTN_IN_MNTH 1.002506
NO_OF_L_CR_TXNS 1.339096
FLG_HAS_CC 1.003318
FLG_HAS_ANY_CHGS 1.421228
AVG_AMT_PER_ATM_TXN 1.13089
AVG_AMT_PER_CSH_WDL_TXN 1.024738
AVG_AMT_PER_CHQ_TXN 1.212004
AVG_AMT_PER_NET_TXN 1.076015
AVG_AMT_PER_MOB_TXN 1.245637
FLG_HAS_NOMINEE 1.002073
FLG_HAS_OLD_LOAN 1.002189
11. 11
AMT_CHGS 1.194723
CHQ_BNCS 1.247004
BAL_CAP 1.013422
SCR 1.008648
Feature Selection
Now, after removing multicollinearity “Feature selection” will be conducted to calculate
the importance of exact independent variable. This was being conducted using
“VARIMP” function.
Inferences:
The variables with higher importance value (Sorted in descending) order will have higher
predictive power than variables with low Important score value. For example,
HOLDING_PERIOD will have the highest impact while in formulating the predictive
model.
Variable Importance
HOLDING_PERIOD 16.04418912
FLG_HAS_CC 14.32595706
OCCUPATIONSELF-EMP 11.92868884
NO_OF_L_CR_TXNS 10.04763136
SCR 9.926999009
BAL_CAP 6.490046607
OCCUPATIONSAL 4.875804888
GENDERM 4.022138465
AVG_AMT_PER_MOB_TXN 3.778146074
LEN_OF_RLTN_IN_MNTH 3.578676901
FLG_HAS_ANY_CHGS 3.38374503
AVG_AMT_PER_CSH_WDL_TXN 2.425247113
AVG_AMT_PER_ATM_TXN 2.369453574
AMT_CHGS 2.302677648
AVG_AMT_PER_NET_TXN 2.206919209
AGE 1.892477818
FLG_HAS_OLD_LOAN 1.797516819
FLG_HAS_NOMINEE 1.510326079
GENDERO 1.456827527
NO_OF_L_DR_TXNS 0.831072302
AVG_AMT_PER_CHQ_TXN 0.790431134
ACC_TYPESA 0.292790356
OCCUPATIONSENP 0.100847172
CHQ_BNCS 0.100478857
12. 12
Building the statistical model
Creating the training and testing datasets
1. The data is being split into 3 parts namely Development, Validation and Holdout sample.
2. Target variable (Binary data) is being equally distributed using random function.
3. Development sample is constituting 50% of the total sample data set.
4. Validation sample is constituting 30% of the totaldataset whereas the holdout sample
constitutes the remaining 20%.
Independent Variable Transformation
Age
While computing independent relationship between target variable and Independent
variable “Age”, the below trends were observed: “R square” is appearing insignificant for
this variable.
Since we were getting a peak at the age bracket of (41-44) and the data beyond
that appears to mirror the trends before the peak, therefore we transformed the
variable Age.
We used following criterion for doing transformation:
$LR_DF.AGE > 43, 43 - (NEW_LR_DF$LR_DF.AGE - 43).
Below is the transformed Age variable mapped showing a linear relationship with R^2
= 0.8068.
y = 0.0029x + 0.1087
R² = 0.1172
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
20.00%
0
500
1000
1500
2000
2500
3000
21 to 25 26 to 28 29 to 31 32 to 34 35 to 37 38 to 40 41 to 44 45 to 48 49 to 51 52 to 55
R Square Test Before Age Imputation
cnt prob Linear (prob)
13. 13
Occupation:
We found that occupation SALARIED and SENP was coming insignificant.
We transformed occupation field on the basis of following criterion:
MYDATA.DEV$DV_OCC = IFELSE (
MYDATA.DEV$LR_DF.OCCUPATION %IN% C ("SENP", "SAL","PROF"),
"SENP-SAL-PROF", "SELF-EMP")
Logistic regression output after transformation:
y = 0.0069x + 0.0861
R² = 0.8068
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
20.00%
2 1 T O
2 5
2 6 T O
2 8
2 9 T O
3 0
3 1 T O
3 2
3 3 T O
3 3
3 4 T O
3 5
3 6 T O
3 7
3 8 T O
3 9
4 0 T O
4 0
4 1 T O
4 3
R SQUARE TEST AFTER AGE IMPUTATION
14. 14
Model Development: Logistic Regression.
Stage 1: Development sample.
Output: The model ran fine in the development stage. All identified independent
variable are significant.
Model Performance Measures:
Goodness of Fit Test: HOSMER-LEMESHOW test
Chi-square value being above 0.05, the model’s performance is significantly accurate.
Concordance Test
Concordance being more than 80%, the model’s performance is significantly accurate.
15. 15
Validation of Model
All the identified independent variable are significant when tested on the validation
sample. Age and Occupation variables when again imputed during this stage and the
results where compared with to ensure strong model performance.
Model Performance on Hold out Sample
At the end, as a final check to examine the predictive power of the model build using
logistics regression, we will test our model on the holdout sample. As seen in theabove
screenshot, all the variables predicted to be significant in the first two steps of the
validation still standout to be significant.
With this, we are ready for the model deployment and its actual implementation.
16. 16
Model Implementation / Deployment Strategy
1. Demographic factors like age, occupation and SCR score of a customer significantly impact
the probability of a customer to buy a personal loan. Unsupervised machine learning
techniques should be used for further customersegmentationto identify profitable segments.
2. As per the above results, Customers transacting through mobile and ATM devices are
significantly more prone to buy personal loans. Hence, the bank should target customers
more likely to use mobile or ATMs services rather than net banking or direct cash/Cheque
withdrawal services for monetary transactions.
3. Customers holding credit cards are generally more prone to heavy transactions and hence
can be a profitable target for cross selling personal loans. The bank should reach out to this
potential customer segment and promote their loan offers.
4. There exists a clear relationship between customer’s account balance and his conversion
rate for personal loans. Customers having a low account balance on an average should be
targeted. Also, customers frequently being penalized for not being able to maintain their
minimum account balance or for transacting from other bank’s ATMS beyond permissible
limits should be identified and targeted.
5. Similarly, the holding period of a customer (Ability to hold money in the account) is an
important factor to focus on. Customers will low holding periods can definitely turn out to be
a profitable segment for cross selling personal loans. Such customers should be segmented
and perused further.
6. Last but not the least, the bank should also consider the length of relationship a customer
has with the bank. Old customers or new customers with frequent interactions with the bank
are more likely to purchase a personal loan from the same bank. A follow-up should be
prioritized for such customers for further profits on cross selling personal loans.
Raw dataset:
PL_XSELL.csv
Metadata:
PL_XSell_Metadata.x
lsx