Predicting Credit Card Defaults with Bootstrap Forest Model

GROUP – 7
Dawei Ye
Hrushikesh Basavanahalli
Jobil Joseph
Ryan Curtis
Shu-Feng Tsao
Yijing Liang
Predicting
Credit Card Defaults
OPIM 5640 – Predictive Modeling – Final Assignment
Data Source from
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA

 The Business Problem
 Data - Data source used
 Modeling Methodology Adapted
 Detailing Major steps in the methodology
 Models
 WHY and HOW we choose this model
 Conclusion
Agenda
2

The Business Problem
Can she JUMP?
3

We took a dataset with 30,000 records of c
redit card borrowers with details
about their demography and behavior and
payment patterns
Dataset Source…Kaggle…..
The goal is to build a Predictive
Model that would
predict if a credit card user would
default on the Upcoming payment
with acceptable accuracy
4

A Doctor
a data doctor… with a
diagnostic approach
5

- Data
- Data
- Data
- Data
- Data
Visualize Data
For Initial
Diagnostics
Build Baseline
Model
Logistic Regression Model
Revised the Model
After Each Iteration
Checking improvement
in accuracy
Data
Cleansing
Data
Pre-processing
Exploratory Data
Analytics
Feature / Engineering
Selection
Trying Different Models
6

Data Dictionary
Data contains a binary variable, default payment (Yes = 1, No = 0), as the response variable. Total
25 variables in the dataset including response variable.
ID: ID of each client
LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit a
nd his/her family (supplementary) credit.
SEX: Gender (1 = male; 2 = female).
EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).
AGE: Age (year).
PAY_0 – PAY_6: History of past payment. The measurement scale for
the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for t
wo months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and
above.
BILL_AMT1-BILL_AMT6: Amount of bill statement (NT dollar). BILL_AMT1 = amount of bill statemen
t in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; . . .; BILL_AMT6 = am
ount of bill statement in April, 2005.
PAY_AMT1-PAY_AMT6: Amount of previous payment (NT dollar). PAY_AMT1 = amount paid in Sep
tember, 2005; PAY_AMT2 = amount paid in August, 2005; . . .;PAY-AMT6 = amount paid in April, 20
05.
7

Data Overview for baseline model
Data summary
Total 30000 records
Summary Details about Training and V
alidation dataset used for the model.
We used a 70:30 split using stratified
random sampling.
Total number of records in training
data: 21000
Total positive cases(default/1) in traini
ng data: 4666
Total negative cases(non-default/0) in
training data: 16334
Total number of records in validatio
n data: 9000
Total positive cases(default/1) in valida
tion data: 1970
Total negative cases(non-default/0) in
validation data: 7030
8

Baseline model-Logistic Regression
Baseline Evaluation
AUC on ROC
Curve Benchmark
Training 72.27%
Validation 72.70%
9

Data Cleansing & Preprocessing
Variable Initial Type Type Changed to
1 SEX: numerical continuous character nominal
2 EDUCATION: numerical continuous character nominal
3 MARRIAGE: numerical continuous character nominal
4 PAY_0: numerical continuous character nominal
Variable Transformation
10 BILL_AMT1 to BILL_AMT6 Standardized the Variable to scale
11 PAY_AMT1 to PAY_AMT6 Standardized the Variable to scale
12 LIMIT_BAL Standardized the Variable to scale
13 Age Standardized the Variable to scale
10

Benchmark Check-Post Preprocessing
AUC on RO
C Curve Benchmark
Pre
Processing
Gain on
Benchmark
Training 72.27% 77.06% 4.79%
Validation 72.70% 77.68% 4.98%
11

Data Visualization / Exploration
Default by Education Level
Imputed
Grad
School
Bachelors
High School
Other
Other
Other
12

Default by Age
21-27
27-31
31-37
37-43
43-79
13

Default by Marriage Status
Married
Single
Other
Imputed
@source
14

Scatter plot across Bill Amount and Pay Amount
There is HIGH Correlation between Bill Amount
(Value of monthly bill) across SIX months
However, there is LOW correlation between
Payment pattern across SIX months
15

Feature Engineering
Three Categories [pay_*, bill_amt*, pay_amt*] of variables shows behavioral patterns
across SIX months
To extract the aggregated pattern across SIX months, we derived FOUR new Variables
from the above Three Categories of variables.
Field Name Description
AMT_OWED
Running or cumulative sum of bill amount - payment amount fo
r each individual
AVG_6MTH_OWED Mean value of AMT_OWED over a 6-month period
MISSED_PAYMENTS
Maximum number of missed payments recorded for the individu
al
BALANCE_TO_LIMIT_RATIO
Average 6-month balance divided by the individual’s credit limit;
note anything <= 0.3 is considered good
16

Missed Payment by Gender
Data Visualization / Exploration…..
17

Missed Payments by Education Level
18

To understand the underlying structure of data, we
did a cluster analysis using hierarchical method
Six different cluster
groups were identified
The Cluster value was
added as NEW
variable to the data
In total FIVE new variable
s were added to the
dataset
FOUR derived variables –
which were standardized.
One Variable representing
SIX clusters – type casted
to character nominal
variable
Data Visualization - Clustering
19

Revised Benchmark
AUC on R
OC Curve
Benchma
rk
Pre
Processing
Feature
Engineering
Gain on
Benchma
rk
Training 72.27% 77.06% 77.15% 4.85%
Validatio
n 72.70% 77.68% 77.78% 4.99%
20

Dimensionality Reduction-PCA
Top TEN principal components -
adding up to 96.32% of variance
in DATA – was considered instea
d of numerical variables
AUC on
ROC
Curve
Benchm
ark
Pre
Processi
ng
Feature
Engineering PCA
Gain on
Benchmark
Training 72.27% 77.06% 77.15% 77.12% 4.85%
Validation 72.70% 77.68% 77.78% 77.69% 4.99%
Revised Benchmark
Dimensionality reduction using PCA method is d
ecreasing AUC value
from previous step. Therefore, we
Decided NOT to consider principal
components in the modeling
21

Trying-Different Models..
Following models were
considered for Analysis.
Stepwise Regression
Bootstrap Forest
Neural Networks
Evaluation Criteria:
Evaluation of each of the models were done based on
AUC under ROC value
Lift Ratio
Misclassification Rate
Accuracy of Positive Cases
Lift Ratio, Misclassification Rate and Accuracy of
Positive Cases were calculated at Probability Cutoff of 0.5
However, in some business context we may have to focus
on other evaluation matrices – like minimum misclassifica
tion
Rate or maximum sensitivity etc., which may lead to a
different model.
To explain this, we have considered an additional
evaluation criteria where we considered the
minimum misclassification rate on validation set.
22

Stepwise Regression
23

Bootstrap Forest
24

Neural Networks
25

Model Evaluation
At Probability cutoff - 0.5
At Probability Cutoff
with minimum
Misclassification Rate
Model AUC Un
der
ROC
Lift
Ratio
Misclassification
Rate
Accuracy of
Positive Cas
es
Threshold
(cutoff)
Lift
Ratio
Misclassification
Rate
Accuracy of
Positive Cas
es
Threshold
(cutoff)
Stepwise Regression
Training 77.40% 3.06 18.02% 67.93% 0.5 3.02 17.95% 67.18%
0.4
Validation 78.01% 3.08 17.78% 67.39% 0.5 2.96 17.71% 64.83%
0.4
Bootstrap Forest
Training 85.20% 3.15 17.44% 69.99% 0.5 3.02 17.01% 67.15%
0.42
Validation 78.80% 3.11 17.53% 68.08% 0.5 3.16 17.56% 69.16%
0.42
Neural Networks
Training 79.30% 3.11 17.53% 69.17% 0.5 3.10 17.57% 68.80%
0.49
Validation 78.27% 3.00 18.03% 65.62% 0.5 3.15 17.51% 68.87%
0.49
Models Comparison
Bootstrap Forest seems fare better across most evaluation metrics. Final model gave a gain of 6.1% AUC under
ROC curve from the initial baseline model benchmark 26

Model Evaluation
Model Analysis- Column Contribution
Analyzing – Column
Contributions in the Model,
Pay_0 the recent repayment s
tatus is the most influencing
factor in this model
27

Model Evaluation
Model Analysis- Column Contribution
Whenever Pay_0 has value 2 (payment delayed
for two months), chance of correctly identifying
default cases re Higher.
For any other value of Pay_0 the
chances of incorrect predictions are
higher.
28

Conclusion
Bootstrap model seems to work better for this problem and context. However for
the same problem with a different context or criterion may lead to a different
model.
Extending the utility of this model beyond this dataset to wider credit card industry
“The Model with sufficient refinement and learning should be able to predict default
trends in the industry and help regulators formulate policies and take preemptive
actions in interest of both USERS and BANK”
29

Predicting Credit Card Defaults with Bootstrap Forest Model

Recommended

Recommended

More Related Content

Similar to Predicting Credit Card Defaults with Bootstrap Forest Model

Similar to Predicting Credit Card Defaults with Bootstrap Forest Model (20)

More from Shu-Feng Tsao

More from Shu-Feng Tsao (7)

Recently uploaded

Recently uploaded (20)

Predicting Credit Card Defaults with Bootstrap Forest Model

Editor's Notes