Name: Kunal Kashyap
College: Indian Institute of Management Kashipur
Case:
Round 3: Grand Finale
Personal Loan Risk Assessment on
Two-Wheeler Loan Customer Base
Business Problem Snapshot
Business Problem
Approach taken
Objectives
• To identify the segment of customers, who have a higher
tendency to default, if they are offered a Personal Loan
• To leverage the existing Two-Wheeler Loan (TW) customer
base to cross sell the Personal Loan product
• To develop a prediction model to classify the customer
base into Risky and Non-Risky categories for rejecting and
considering them for PL offer respectively
Problem
Statement
Credit Process Flow
Analyzing data
Modelling
Cost-Benefit Analysis
Live loans
closed loans
enquiries Gender
Age
Interest rate
Tenure
EMI
MOB
First EMI Bounce
Total down payment
Total Loan amount
Two-Wheeler loans
Employment type
Number of times defaulted
Cost of Asset
bounces with TVS Credit
bounces in last 3 months
Available
data
Payment History of 1.2 Lakh Customers
Prediction Model will help in the classification
Recommendation
& Deployment
Methodology Used | Research Insights
Start
End
Key Highlight
Team Data
Science
Process
(TDSP)
methodology
has been used
for solving this
case
Business
Understanding
Modelling
Data acquisition
& Understanding
Deployment
Approach
TDSP methodology
• Small ticket personal loans (STPL) are considered as
personal loans of ticket size less than Rs 50,000
• STPL market – 12000 Cr as of Aug 2020 | Half of them is
for loans below Rs 5000
• TG - young, low income, digitally savvy customers who
have small ticket and short-term credit needs, and no or
limited credit history customers
• Demand driver -> millennials and young borrowers in
the age group 18-30 years
140 %
Growth in FY 2019 | Driven by STPL
segment
• Home renovation, wedding, higher education or
travel costs
• To meet a medical emergency et al.
End-use
Research Insights
• Alternative data – digital footprint of customers such
as Social media profile, mobile bill, Social scoring by
psychometric analysis through digital footprints
Sources: Microsoft TDSP methodology| Paisa bazaar | BCG report | Financial Express
Business Problem
Approach taken
Analyzing data
Modelling
Cost-Benefit Analysis
Recommendation
& Deployment
Data Wrangling, Exploration & Cleaning
Key Highlight
Ensemble
algorithm to
be used for
future work to
achieve higher
accuracy and
enhanced
business
opportunities
Features V1, V11,
V13, and V17 have
not been used for
modelling
technique
Transformation -
Data has been
normalized using
Min-Max method
One hot encoding
has been applied
on features
V15(Gender) and
V16(Employment)
Dataset was split
in equal
proportion for
Testing and
Training purpose
Four features –
V21, V22, V28,
and V29 were
removed due to
missing values or
very less data
A new feature
named ‘Age’ has
been created from
V18 and V18 is
removed
Random Over-
sampling and
Random Under-
sampling of minority
class and majority
class was performed
respectively due to
imbalanced nature
of dataset
Step 7
Step 6
Step 5
Step 4
Step 3
Step 2
We are left with
119,486
customers after
removing rows
with incomplete
data
Step 1
The data consists of past loan history of 119,529 customers; It has 30 features from various sources
Data Source
Business Problem
Approach taken
Analyzing data
Modelling
Cost-Benefit Analysis
Recommendation
& Deployment
Classification
Good Customer
(Non-default)
Bad Customer
(Default)
Random
Oversampling of
minority class
Random
Undersampling of
majority class
Modeling Architecture
Modified
dataset
Given
dataset
Loaded
dataset
Evaluation metrics
Test set
Training set
Random Forrest Model
Overall dataset
Other
Models
Logistic regression
Deep Neural Network
SMOTE using KNN for
minority class
generation
Random Forrest Model
Business Problem
Approach taken
Analyzing data
Modelling
Cost-Benefit Analysis
Recommendation
& Deployment
Architecture
Classification Model
Evaluation Metrics
Modelling: Random Forest
Features Description Importance Cumulative score
V27 Number of times defaulted in last 12 months 0.128 0.128
V26 Number of times defaulted in last 6 months 0.103 0.232
Age Age of customers 0.083 0.314
V25 Number of times defaulted in last 3 months 0.076 0.390
V7 Total down payment of existing loan 0.068 0.457
V8 EMI of existing loan 0.066 0.523
V6 Cost of Asset (existing loan) 0.064 0.588
V9 Total Loan amount of existing loan 0.061 0.649
V23 Number of closed loans 0.059 0.708
V14 Rate of interest for existing loan 0.054 0.761
V4 MOB (Month of business with TVS Credit) 0.051 0.813
Since our objective is to segregate the customers into two
categories, we will use a Classification Model to achieve this.
Random Forest Classifier
This method is an ensemble technique used for classification by
constructing multitude of decision trees on training set (we trained
model with 1000 trees with 99.9% accuracy on training set)
Below are the top 11 variables with higher importance in
building the model
From the Random Forest model, we identified the
parameters contributing significantly in classifying the risky
& non-risky customers. The Importance column in the table
shows the significance of parameters. Higher the value,
higher the impact!
Classification Model
Output Snapshot
Business Problem
Approach taken
Analyzing data
Modelling
Cost-Benefit Analysis
Recommendation
& Deployment
Architecture
Classification Model
Evaluation Metrics
Note: Python code files and API files are attached on Annexure slide
Evaluation Metrics: Confusion matrix provides a performance summary of the classifier
Evaluation metric on Training set
Accuracy Sensitivity Precision Specificity F1 Score MCC
99.94% 100% 99.80% 99.91% 99.91% 99.87%
True Negative
(TN)
True Positive
(TP)
False Positive
(FP)
False Negative
(FN)
34942 17618 31 0
Evaluation metric on Test set
Accuracy Sensitivity Precision Specificity F1 Score MCC
98.75% 99.82% 96.53% 98.22% 98.15% 97.24%
True Negative
(TN)
True Positive
(TP)
False Positive
(FP)
False Negative
(FN)
34524 17412 625 31
99.94% of customers were
correctly labelled by the Model
Of all the customers, who were
predicted of defaulting on loan
payment, 99.80% defaulted
The Model predicted 100%
customers correctly who could
default on loan payment
Of all the customers, 99.91% of
non-defaulters were correctly
labelled by the Model
98.75% of customers were
correctly labelled by the Model
Of all the customers, who were
predicted of defaulting on loan
payment, 96.53% defaulted
The Model predicted 99.82%
customers correctly who could
default on loan payment
Of all the customers, 98.22% of
non-defaulters were correctly
labelled by the Model
Notes: MCC – Matthew Correlation Coefficient
Business Problem
Approach taken
Analyzing data
Modelling
Cost-Benefit Analysis
Recommendation
& Deployment
Architecture
Classification Model
Evaluation Metrics
Business Metrics
Note:
V6: Cost of
Asset (existing
loan)
V7: Total
down
payment of
existing loan
V8: EMI of
existing loan
V10: Tenure of
existing loan:
Evaluation metric on Original full dataset
Accuracy Sensitivity Precision Specificity F1 Score MCC
98.80% 99.89% 64.69% 99.78% 78.53% 79.89%
True Negative
(TN)
True Positive
(TP)
False Positive
(FP)
False Negative
(FN)
115446 2611 1425 3
98.80% of customers were
correctly labelled by the Model
Of all the customers, who were
predicted of defaulting on loan
payment, 64.69% defaulted
The Model predicted 99.89%
customers correctly who could
default on loan payment
Of all the customers, 99.78% of
non-defaulters were correctly
labelled by the Model
Business
metrics
Particulars Business Value
Avg. loan amount (V9) 39322
No. of defaults (V30) 2614
Total loss (without model) 102787708
Avg. loan amount 39322
No. of defaults (model_FN) 3
Total loss with defaults_model 117966
Opportunity loss( # customers)_FP 1425
value lost (V10*V8 + V7 -V6 ) 258
opportunity loss with model 367650
Total loss with model 485616
Loss saved with modelling 102302092
Percentage of loss saved 99.53%
Net Profit
(-72634990)
Total Profit
(30152718)
Total Loss
(102787708)
Without Model With Proposed Model
Net Profit
(29299452)
Total Profit
(29785068)
Total Loss
(485616)
With proposed model, We are making transition
from approx. - 7 crore to +3 crore in profits.
We are saving around 99.5% in losses from using the RF model
Business Problem
Approach taken
Analyzing data
Modelling
Cost-Benefit
Analysis
Recommendation
& Deployment
Deployment
Recommendations
• It is recommended to use
analytical model like the
proposed one to save losses for
this initiative
• Alternative data – Digital
footprint of customers such as
Social media profile, Social scoring
by psychometric analysis through
digital footprints to be used
Business Problem
Approach taken
Analyzing data
Modelling
Cost-Benefit Analysis
Recommendation
& Deployment
Call POST: Created API is called using POST where it displays HTML page to enter the input of feature. Post execution, console will let us know
the output based on model.
THANK YOU
“It always seems impossible until it’s done.”
- Nelson Mandela
Annexures
Feature Feature Definition
V1 Customer's ID
V2 First EMI Bounce (0 : No, 1: Yes) (existing loan)
V3 Number of bounces in last 3 months Outside TVS Credit
V4 MOB (Month of business with TVS Credit)
V5 Number of bounces with TVS Credit
V6 Cost of Asset (existing loan)
V7 Total down payment of existing loan
V8 EMI of existing loan
V9 Total Loan amount of existing loan
V10 Tenure of existing loan
V11 Customer's Geographical Area Code
V12 Customer's TW Dealer's Code
V13 Customer's TW Model’s Code
V14 Rate of interest for existing loan
V15 Gender
V16
Employment type of customer (SAL : Salaried, SELF : Self-employed, HOUSEWIFE, PENS :
Pensioner, STUDENT)
V17 Pin code
V18 Date of Birth
V19 Number of Live loans
V20 Number of Two-Wheeler loans
V21 Maximum sanction amount of Live Loans
V22 Number of new loans taken in last 3 months
V23 Number of closed loans
V24 Number of enquiries
V25 Number of times defaulted in last 3 months
V26 Number of times defaulted in last 6 months
V27 Number of times defaulted in last 12 months
V28 Maximum loan amount sanctioned for any Gold loan
V29 Maximum loan amount sanctioned for any personal loan
V30 Target variable ( 1: Bad Customer / 0 : Good Customer )
Assumptions:
• Complete EMI duration has been
taken irrespective of at what point
customer is going default due to lack
of information
• This conservative approach should be
offset by the depreciation of assets
• Avg. loan amount and avg. tenure are
considered for calculation
Data Dictionary
Python Code

Personal Loan Risk Assessment

  • 1.
    Name: Kunal Kashyap College:Indian Institute of Management Kashipur Case: Round 3: Grand Finale Personal Loan Risk Assessment on Two-Wheeler Loan Customer Base
  • 2.
    Business Problem Snapshot BusinessProblem Approach taken Objectives • To identify the segment of customers, who have a higher tendency to default, if they are offered a Personal Loan • To leverage the existing Two-Wheeler Loan (TW) customer base to cross sell the Personal Loan product • To develop a prediction model to classify the customer base into Risky and Non-Risky categories for rejecting and considering them for PL offer respectively Problem Statement Credit Process Flow Analyzing data Modelling Cost-Benefit Analysis Live loans closed loans enquiries Gender Age Interest rate Tenure EMI MOB First EMI Bounce Total down payment Total Loan amount Two-Wheeler loans Employment type Number of times defaulted Cost of Asset bounces with TVS Credit bounces in last 3 months Available data Payment History of 1.2 Lakh Customers Prediction Model will help in the classification Recommendation & Deployment
  • 3.
    Methodology Used |Research Insights Start End Key Highlight Team Data Science Process (TDSP) methodology has been used for solving this case Business Understanding Modelling Data acquisition & Understanding Deployment Approach TDSP methodology • Small ticket personal loans (STPL) are considered as personal loans of ticket size less than Rs 50,000 • STPL market – 12000 Cr as of Aug 2020 | Half of them is for loans below Rs 5000 • TG - young, low income, digitally savvy customers who have small ticket and short-term credit needs, and no or limited credit history customers • Demand driver -> millennials and young borrowers in the age group 18-30 years 140 % Growth in FY 2019 | Driven by STPL segment • Home renovation, wedding, higher education or travel costs • To meet a medical emergency et al. End-use Research Insights • Alternative data – digital footprint of customers such as Social media profile, mobile bill, Social scoring by psychometric analysis through digital footprints Sources: Microsoft TDSP methodology| Paisa bazaar | BCG report | Financial Express Business Problem Approach taken Analyzing data Modelling Cost-Benefit Analysis Recommendation & Deployment
  • 4.
    Data Wrangling, Exploration& Cleaning Key Highlight Ensemble algorithm to be used for future work to achieve higher accuracy and enhanced business opportunities Features V1, V11, V13, and V17 have not been used for modelling technique Transformation - Data has been normalized using Min-Max method One hot encoding has been applied on features V15(Gender) and V16(Employment) Dataset was split in equal proportion for Testing and Training purpose Four features – V21, V22, V28, and V29 were removed due to missing values or very less data A new feature named ‘Age’ has been created from V18 and V18 is removed Random Over- sampling and Random Under- sampling of minority class and majority class was performed respectively due to imbalanced nature of dataset Step 7 Step 6 Step 5 Step 4 Step 3 Step 2 We are left with 119,486 customers after removing rows with incomplete data Step 1 The data consists of past loan history of 119,529 customers; It has 30 features from various sources Data Source Business Problem Approach taken Analyzing data Modelling Cost-Benefit Analysis Recommendation & Deployment
  • 5.
    Classification Good Customer (Non-default) Bad Customer (Default) Random Oversamplingof minority class Random Undersampling of majority class Modeling Architecture Modified dataset Given dataset Loaded dataset Evaluation metrics Test set Training set Random Forrest Model Overall dataset Other Models Logistic regression Deep Neural Network SMOTE using KNN for minority class generation Random Forrest Model Business Problem Approach taken Analyzing data Modelling Cost-Benefit Analysis Recommendation & Deployment Architecture Classification Model Evaluation Metrics
  • 6.
    Modelling: Random Forest FeaturesDescription Importance Cumulative score V27 Number of times defaulted in last 12 months 0.128 0.128 V26 Number of times defaulted in last 6 months 0.103 0.232 Age Age of customers 0.083 0.314 V25 Number of times defaulted in last 3 months 0.076 0.390 V7 Total down payment of existing loan 0.068 0.457 V8 EMI of existing loan 0.066 0.523 V6 Cost of Asset (existing loan) 0.064 0.588 V9 Total Loan amount of existing loan 0.061 0.649 V23 Number of closed loans 0.059 0.708 V14 Rate of interest for existing loan 0.054 0.761 V4 MOB (Month of business with TVS Credit) 0.051 0.813 Since our objective is to segregate the customers into two categories, we will use a Classification Model to achieve this. Random Forest Classifier This method is an ensemble technique used for classification by constructing multitude of decision trees on training set (we trained model with 1000 trees with 99.9% accuracy on training set) Below are the top 11 variables with higher importance in building the model From the Random Forest model, we identified the parameters contributing significantly in classifying the risky & non-risky customers. The Importance column in the table shows the significance of parameters. Higher the value, higher the impact! Classification Model Output Snapshot Business Problem Approach taken Analyzing data Modelling Cost-Benefit Analysis Recommendation & Deployment Architecture Classification Model Evaluation Metrics Note: Python code files and API files are attached on Annexure slide
  • 7.
    Evaluation Metrics: Confusionmatrix provides a performance summary of the classifier Evaluation metric on Training set Accuracy Sensitivity Precision Specificity F1 Score MCC 99.94% 100% 99.80% 99.91% 99.91% 99.87% True Negative (TN) True Positive (TP) False Positive (FP) False Negative (FN) 34942 17618 31 0 Evaluation metric on Test set Accuracy Sensitivity Precision Specificity F1 Score MCC 98.75% 99.82% 96.53% 98.22% 98.15% 97.24% True Negative (TN) True Positive (TP) False Positive (FP) False Negative (FN) 34524 17412 625 31 99.94% of customers were correctly labelled by the Model Of all the customers, who were predicted of defaulting on loan payment, 99.80% defaulted The Model predicted 100% customers correctly who could default on loan payment Of all the customers, 99.91% of non-defaulters were correctly labelled by the Model 98.75% of customers were correctly labelled by the Model Of all the customers, who were predicted of defaulting on loan payment, 96.53% defaulted The Model predicted 99.82% customers correctly who could default on loan payment Of all the customers, 98.22% of non-defaulters were correctly labelled by the Model Notes: MCC – Matthew Correlation Coefficient Business Problem Approach taken Analyzing data Modelling Cost-Benefit Analysis Recommendation & Deployment Architecture Classification Model Evaluation Metrics
  • 8.
    Business Metrics Note: V6: Costof Asset (existing loan) V7: Total down payment of existing loan V8: EMI of existing loan V10: Tenure of existing loan: Evaluation metric on Original full dataset Accuracy Sensitivity Precision Specificity F1 Score MCC 98.80% 99.89% 64.69% 99.78% 78.53% 79.89% True Negative (TN) True Positive (TP) False Positive (FP) False Negative (FN) 115446 2611 1425 3 98.80% of customers were correctly labelled by the Model Of all the customers, who were predicted of defaulting on loan payment, 64.69% defaulted The Model predicted 99.89% customers correctly who could default on loan payment Of all the customers, 99.78% of non-defaulters were correctly labelled by the Model Business metrics Particulars Business Value Avg. loan amount (V9) 39322 No. of defaults (V30) 2614 Total loss (without model) 102787708 Avg. loan amount 39322 No. of defaults (model_FN) 3 Total loss with defaults_model 117966 Opportunity loss( # customers)_FP 1425 value lost (V10*V8 + V7 -V6 ) 258 opportunity loss with model 367650 Total loss with model 485616 Loss saved with modelling 102302092 Percentage of loss saved 99.53% Net Profit (-72634990) Total Profit (30152718) Total Loss (102787708) Without Model With Proposed Model Net Profit (29299452) Total Profit (29785068) Total Loss (485616) With proposed model, We are making transition from approx. - 7 crore to +3 crore in profits. We are saving around 99.5% in losses from using the RF model Business Problem Approach taken Analyzing data Modelling Cost-Benefit Analysis Recommendation & Deployment
  • 9.
    Deployment Recommendations • It isrecommended to use analytical model like the proposed one to save losses for this initiative • Alternative data – Digital footprint of customers such as Social media profile, Social scoring by psychometric analysis through digital footprints to be used Business Problem Approach taken Analyzing data Modelling Cost-Benefit Analysis Recommendation & Deployment Call POST: Created API is called using POST where it displays HTML page to enter the input of feature. Post execution, console will let us know the output based on model.
  • 10.
    THANK YOU “It alwaysseems impossible until it’s done.” - Nelson Mandela
  • 11.
  • 12.
    Feature Feature Definition V1Customer's ID V2 First EMI Bounce (0 : No, 1: Yes) (existing loan) V3 Number of bounces in last 3 months Outside TVS Credit V4 MOB (Month of business with TVS Credit) V5 Number of bounces with TVS Credit V6 Cost of Asset (existing loan) V7 Total down payment of existing loan V8 EMI of existing loan V9 Total Loan amount of existing loan V10 Tenure of existing loan V11 Customer's Geographical Area Code V12 Customer's TW Dealer's Code V13 Customer's TW Model’s Code V14 Rate of interest for existing loan V15 Gender V16 Employment type of customer (SAL : Salaried, SELF : Self-employed, HOUSEWIFE, PENS : Pensioner, STUDENT) V17 Pin code V18 Date of Birth V19 Number of Live loans V20 Number of Two-Wheeler loans V21 Maximum sanction amount of Live Loans V22 Number of new loans taken in last 3 months V23 Number of closed loans V24 Number of enquiries V25 Number of times defaulted in last 3 months V26 Number of times defaulted in last 6 months V27 Number of times defaulted in last 12 months V28 Maximum loan amount sanctioned for any Gold loan V29 Maximum loan amount sanctioned for any personal loan V30 Target variable ( 1: Bad Customer / 0 : Good Customer ) Assumptions: • Complete EMI duration has been taken irrespective of at what point customer is going default due to lack of information • This conservative approach should be offset by the depreciation of assets • Avg. loan amount and avg. tenure are considered for calculation Data Dictionary Python Code