Bank loan purchase modeling

Project – 3.
DATA MINING
Thera Bank - Loan Purchase Modeling
By,
Saleesh Satheeshchandran, PGP BABI 2019-20 (G6)

Objective of the project - Thera Bank - Loan Purchase
Modeling
The Objective of the project is to build the best model which can classify the right customers
who have a higher probability of purchasing the loan.
The dataset has data on 5000 customers. The data include customer demographic information
(age, income, etc.), the customer's relationship with the bank (mortgage, securities account,
etc.), and the customer response to the last personal loan campaign (Personal Loan). Among
these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them
in the earlier campaign.
Assumptions
There are no particular assumptions
Exploratory Data Analysis
1. The dimension of the data shows 5000 observations.
2. Some of the categorical variables are converted to factors.
3. Boxplots of the following variables are done.
- Age - No outliers
- Experience - No outliers
- Income – Numerous outliers
- Zipcode – 1 outlier
- Family Members – No outliers

Age - No outliers
- Experience - No outliers

- Income – Numerous outliers
- Zipcode – 1 outlier

Family Members – No outliers
Correlation
The following variables have high correlation.
- Age / Experience
- Income/CCAvg

Clustering
Following clustering models were applied.
1. K-Means
2. Hierarchical
The K-means algorithm is executed in the following steps:
• Partition of objects into k non-empty subsets
• Identifying the cluster centroids (mean point) of the current partition.
• Assigning each point to a specific cluster
• Compute the distances from each point and allot points to the cluster where
the distance from the centroid is minimum.
• After re-allotting the points, find the centroid of the new cluster formed.
• Three clusters are obtained by the k-means clustering.
• These are not very distinctive clusters.
• In this project the k-means clustering has not been very effective.
The Hierarchical clustering build a cluster hierarchy that is commonly displayed as a
tree diagram called a dendrogram. They begin with each object in a separate cluster.
At each step, the two clusters that are most similar are joined into a single new
cluster. Once fused, objects are never separated.
The horizontal axis represents the clusters. The vertical scale on the dendrogram
represent the distance or dissimilarity. Each joining (fusion) of two clusters is
represented on the diagram by the splitting of a vertical line into two vertical lines.
The vertical position of the split, shown by a short bar gives the distance
(dissimilarity) between the two clusters.

CART Model
The algorithm of decision tree models works by repeatedly partitioning the data into multiple
sub-spaces, so that the outcomes in each final sub-space is as homogeneous as possible. This
approach is technically called recursive partitioning.
The resulting tree is composed of decision nodes, branches and leaf nodes. The tree is placed
from upside to down, so the root is at the top and leaves indicating the outcome is put at the
bottom.
The tree will stop growing by the following three criteria
all leaf nodes are pure with a single class;
a pre-specified minimum number of training observations that cannot be assigned to each leaf
nodes with any splitting methods;
The number of observations in the leaf node reaches the pre-specified minimum one.
However, this full tree including all predictor appears to be very complex and cisdifficult to
interpret in the situation where we have a large data sets with multiple predictors.
Additionally, it is easy to see that, a fully grown tree will overfit the training data and might
lead to poor test set performance.

PRUNING
Unpruned Tree
Briefly, our goal here is to see if a smaller subtree can give us comparable results to the fully
grown tree. If yes, we should go for the simpler tree because it reduces the likelihood of
overfitting.
One possible robust strategy of pruning the tree (or stopping the tree to grow) consists of
avoiding splitting a partition if the split does not significantly improves the overall quality of
the model.
Pruned Tree

Random Forest
Random forest is a tree-based algorithm which involves building several trees (decision
trees), then combining their output to improve generalization ability of the model. The
method of combining trees is known as an ensemble method. Ensembling is a combination of
weak learners (individual trees) to produce a strong learner.
A Random Forest model is created with default parameters and then fine tuned the model by
changing ‘mtry’. We can tune the random forest model by changing the number of trees
(ntree) and the number of variables randomly sampled at each stage (mtry).
Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that
every input row gets predicted at least a few times.
Mtry: Number of variables randomly sampled as candidates at each split. The default values
are different for classification (sqrt(p) where p is number of variables in x) and regression
(p/3)

Testing
After doing the Random Forest in train data. The next step is to apply the same model on the
test data and see its validity.

Model Performance
The following Performance Measure are used to test the robustness of the model in the real
world.
1. Confusion Matrix
2. ROC Curves, Gini Coefficient
3. Concordance-Discordance ratio
Confusion Matrix:
A confusion matrix is a summary of prediction results on a classification problem.
The number of correct and incorrect predictions are summarized with count values and
broken down by each class
ROC Curve
A ROC curve plots the performance of a binary classifier under various threshold
settings; this is measured by true positive rate and false positive rate. If your
classifier predicts “true” more often, it will have more true positives (good) but also
more false positives (bad). If your classifier is more conservative, predicting “true”
less often, it will have fewer false positives but fewer true positives as well. The ROC
curve is a graphical representation of this tradeoff.
Concordance-Discordance ratio
Used concordant and discordant pairs to describe the relationship between pairs of
observations. To calculate the concordant and discordant pairs, the data are treated
as ordinal, so ordinal data should be appropriate for your application. The number of
concordant and discordant pairs are used in calculations for Kendall's tau, which
measures the association between two ordinal variables.

Bank loan purchase modeling

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Bank loan purchase modeling

Similar to Bank loan purchase modeling (20)

More from Saleesh Satheeshchandran

More from Saleesh Satheeshchandran (17)

Recently uploaded

Recently uploaded (20)

Bank loan purchase modeling