-The goal of this project is to build a predictive model to help determine the default payment for upcoming month on existing credit cards.
This project was conducted on a research dataset related to customers’ credit card default payments in Taiwan. The data were collected by the card issuing banks in Taiwan.
-This dataset contains information on default payments, demographic factors, credit data, histories of payments, and bill statements of credit card clients from April 2005 to September 2005 (Lichman, (2013). This dataset contains 30000 records and is available in kaggle.
-This dataset and business problem is a Supervised Classification Problem where intention is to predict if a customer would default (on next payment ) or not based on the customer’s historical and demographic information. Therefore, the response variable (or outcome variable) is “default payment.”
-The ability of a bank to predict if a customer is about to default would help banks and financial organizations to take preemptive actions to mitigate this risk.
Q3 2024 Earnings Conference Call and Webcast Slides
Predicting Credit Card Defaults with Bootstrap Forest Model
1. GROUP – 7
Dawei Ye
Hrushikesh Basavanahalli
Jobil Joseph
Ryan Curtis
Shu-Feng Tsao
Yijing Liang
Predicting
Credit Card Defaults
OPIM 5640 – Predictive Modeling – Final Assignment
Data Source from
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA
2. The Business Problem
Data - Data source used
Modeling Methodology Adapted
Detailing Major steps in the methodology
Models
WHY and HOW we choose this model
Conclusion
Agenda
2
4. We took a dataset with 30,000 records of c
redit card borrowers with details
about their demography and behavior and
payment patterns
Dataset Source…Kaggle…..
The goal is to build a Predictive
Model that would
predict if a credit card user would
default on the Upcoming payment
with acceptable accuracy
4
6. - Data
- Data
- Data
- Data
- Data
Visualize Data
For Initial
Diagnostics
Build Baseline
Model
Logistic Regression Model
Revised the Model
After Each Iteration
Checking improvement
in accuracy
Data
Cleansing
Data
Pre-processing
Exploratory Data
Analytics
Feature / Engineering
Selection
Trying Different Models
6
7. Data Dictionary
Data contains a binary variable, default payment (Yes = 1, No = 0), as the response variable. Total
25 variables in the dataset including response variable.
ID: ID of each client
LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit a
nd his/her family (supplementary) credit.
SEX: Gender (1 = male; 2 = female).
EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).
AGE: Age (year).
PAY_0 – PAY_6: History of past payment. The measurement scale for
the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for t
wo months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and
above.
BILL_AMT1-BILL_AMT6: Amount of bill statement (NT dollar). BILL_AMT1 = amount of bill statemen
t in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; . . .; BILL_AMT6 = am
ount of bill statement in April, 2005.
PAY_AMT1-PAY_AMT6: Amount of previous payment (NT dollar). PAY_AMT1 = amount paid in Sep
tember, 2005; PAY_AMT2 = amount paid in August, 2005; . . .;PAY-AMT6 = amount paid in April, 20
05.
7
8. Data Overview for baseline model
Data summary
Total 30000 records
Summary Details about Training and V
alidation dataset used for the model.
We used a 70:30 split using stratified
random sampling.
Total number of records in training
data: 21000
Total positive cases(default/1) in traini
ng data: 4666
Total negative cases(non-default/0) in
training data: 16334
Total number of records in validatio
n data: 9000
Total positive cases(default/1) in valida
tion data: 1970
Total negative cases(non-default/0) in
validation data: 7030
8
10. Data Cleansing & Preprocessing
Variable Initial Type Type Changed to
1 SEX: numerical continuous character nominal
2 EDUCATION: numerical continuous character nominal
3 MARRIAGE: numerical continuous character nominal
4 PAY_0: numerical continuous character nominal
5 PAY_2: numerical continuous character nominal
6 PAY_3: numerical continuous character nominal
7 PAY_4: numerical continuous character nominal
8 PAY_5: numerical continuous character nominal
9 PAY_6: numerical continuous character nominal
Variable Transformation
10 BILL_AMT1 to BILL_AMT6 Standardized the Variable to scale
11 PAY_AMT1 to PAY_AMT6 Standardized the Variable to scale
12 LIMIT_BAL Standardized the Variable to scale
13 Age Standardized the Variable to scale
10
11. Benchmark Check-Post Preprocessing
AUC on RO
C Curve Benchmark
Pre
Processing
Gain on
Benchmark
Training 72.27% 77.06% 4.79%
Validation 72.70% 77.68% 4.98%
11
12. Data Visualization / Exploration
Default by Education Level
Imputed
Grad
School
Bachelors
High School
Other
Other
Other
12
14. Default by Marriage Status
Married
Single
Other
Imputed
@source
Data Visualization / Exploration
14
15. Scatter plot across Bill Amount and Pay Amount
There is HIGH Correlation between Bill Amount
(Value of monthly bill) across SIX months
However, there is LOW correlation between
Payment pattern across SIX months
Data Visualization / Exploration
15
16. Feature Engineering
Three Categories [pay_*, bill_amt*, pay_amt*] of variables shows behavioral patterns
across SIX months
To extract the aggregated pattern across SIX months, we derived FOUR new Variables
from the above Three Categories of variables.
Field Name Description
AMT_OWED
Running or cumulative sum of bill amount - payment amount fo
r each individual
AVG_6MTH_OWED Mean value of AMT_OWED over a 6-month period
MISSED_PAYMENTS
Maximum number of missed payments recorded for the individu
al
BALANCE_TO_LIMIT_RATIO
Average 6-month balance divided by the individual’s credit limit;
note anything <= 0.3 is considered good
16
19. To understand the underlying structure of data, we
did a cluster analysis using hierarchical method
Six different cluster
groups were identified
The Cluster value was
added as NEW
variable to the data
In total FIVE new variable
s were added to the
dataset
FOUR derived variables –
which were standardized.
One Variable representing
SIX clusters – type casted
to character nominal
variable
Data Visualization - Clustering
19
20. Revised Benchmark
AUC on R
OC Curve
Benchma
rk
Pre
Processing
Feature
Engineering
Gain on
Benchma
rk
Training 72.27% 77.06% 77.15% 4.85%
Validatio
n 72.70% 77.68% 77.78% 4.99%
20
21. Dimensionality Reduction-PCA
Top TEN principal components -
adding up to 96.32% of variance
in DATA – was considered instea
d of numerical variables
AUC on
ROC
Curve
Benchm
ark
Pre
Processi
ng
Feature
Engineering PCA
Gain on
Benchmark
Training 72.27% 77.06% 77.15% 77.12% 4.85%
Validation 72.70% 77.68% 77.78% 77.69% 4.99%
Revised Benchmark
Dimensionality reduction using PCA method is d
ecreasing AUC value
from previous step. Therefore, we
Decided NOT to consider principal
components in the modeling
21
22. Trying-Different Models..
Following models were
considered for Analysis.
Stepwise Regression
Bootstrap Forest
Neural Networks
Evaluation Criteria:
Evaluation of each of the models were done based on
AUC under ROC value
Lift Ratio
Misclassification Rate
Accuracy of Positive Cases
Lift Ratio, Misclassification Rate and Accuracy of
Positive Cases were calculated at Probability Cutoff of 0.5
However, in some business context we may have to focus
on other evaluation matrices – like minimum misclassifica
tion
Rate or maximum sensitivity etc., which may lead to a
different model.
To explain this, we have considered an additional
evaluation criteria where we considered the
minimum misclassification rate on validation set.
22
26. Model Evaluation
At Probability cutoff - 0.5
At Probability Cutoff
with minimum
Misclassification Rate
Model AUC Un
der
ROC
Lift
Ratio
Misclassification
Rate
Accuracy of
Positive Cas
es
Threshold
(cutoff)
Lift
Ratio
Misclassification
Rate
Accuracy of
Positive Cas
es
Threshold
(cutoff)
Stepwise Regression
Training 77.40% 3.06 18.02% 67.93% 0.5 3.02 17.95% 67.18%
0.4
Validation 78.01% 3.08 17.78% 67.39% 0.5 2.96 17.71% 64.83%
0.4
Bootstrap Forest
Training 85.20% 3.15 17.44% 69.99% 0.5 3.02 17.01% 67.15%
0.42
Validation 78.80% 3.11 17.53% 68.08% 0.5 3.16 17.56% 69.16%
0.42
Neural Networks
Training 79.30% 3.11 17.53% 69.17% 0.5 3.10 17.57% 68.80%
0.49
Validation 78.27% 3.00 18.03% 65.62% 0.5 3.15 17.51% 68.87%
0.49
Models Comparison
Bootstrap Forest seems fare better across most evaluation metrics. Final model gave a gain of 6.1% AUC under
ROC curve from the initial baseline model benchmark 26
27. Model Evaluation
Model Analysis- Column Contribution
Analyzing – Column
Contributions in the Model,
Pay_0 the recent repayment s
tatus is the most influencing
factor in this model
27
28. Model Evaluation
Model Analysis- Column Contribution
Whenever Pay_0 has value 2 (payment delayed
for two months), chance of correctly identifying
default cases re Higher.
For any other value of Pay_0 the
chances of incorrect predictions are
higher.
28
29. Conclusion
Bootstrap model seems to work better for this problem and context. However for
the same problem with a different context or criterion may lead to a different
model.
Extending the utility of this model beyond this dataset to wider credit card industry
“The Model with sufficient refinement and learning should be able to predict default
trends in the industry and help regulators formulate policies and take preemptive
actions in interest of both USERS and BANK”
29
To Predict if a borrower would default or NOT
Team’s goal is to predict weather a borrower would default on his / her credit card due or NOT
This would help Banks decide on RISK the bank is taking up while issue a Credit Card or deciding on the
- Default NOT necessarily mean bad for bank IF borrower recovers and pays up all necessary fee!
- But very important for bank to assess the RISK they are carrying while approving a revolving credit for the borrower.
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
Analogy of Doctor diagnosing a patient… and improving his health so that… we are confident that he would be able to jump…
Data Doctor accepted the data
Visualized the data to do initial diagnostics
Data Cleansing
Data Pre-processing
Exploratory Data Analysis
Feature / Engineering Selection
Try different Model details
Evaluation of the Model
Publishing the model
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….
To Go to a Data Doctor to resolve the modeling problem..
To check if the borrower is Fit enough to pay or NOT….