SlideShare a Scribd company logo
1 of 24
Credit Card Default Risk Analysis
A Case Study of 3 Classification Algorithms
Teresa Nan
June 2020
Project mentor: Tommy Blanchard
1
Problems to Resolve
Problem Statement
● ML applications focused on
credit score predicting.
● Relying on credit scores and
credit history.
● Miss valuable customers with no
credit history. I.e. immigrants.
● Regulatory constraints on
banking industry forbids some
ML algorithms.
2
Purpose of Project
● Conduct quantitative analysis
on credit default risk by
applying three interpretable
machine learning models
without utilizing credit score or
credit history.
Who Should Care?
Credit Card Companies Commercial Banks
* Image source: Google image 3
Approach Overview
Data Cleaning
Understand and Clean
● Find information on
undocumented
columns values
● Clean data to get it
ready for analysis
Data Exploration
Graphical and Statistical
● Exam data with
visualization
● Verify findings with
statistical tests
Predictive Modeling
Machine Learning
● Logistic Regression
● Random Forest
● XGBoost
4
Data Acquisition
Real Credit Card Usage Data
5
Why This Dataset?
● Real credit card data
● Comprehensive and complete
● 30,000 customers
● Usage of 6 months
● Age from 20-79
● Demographic factors
● No credit score or credit history
*Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Dataset
● Default Payments of Credit
Card Clients in Taiwan from
2005
● Source: Public dataset from
Kaggle.
● Original Source: UCI Machine
Learning Repository*
Part 1
Exploratory Data
Analysis
What demographic factors
impact payment default risk?
6
Gender Variable
30% of males and
26% of females
have payment
default.
7
Education Variable
Higher education
level, lower default
risk.
8
“Others” only consists 1.56% of total customers even if they appear to have the least default.
Age Variable
30-50:
Lowest risk
< 30 or >50:
Risk increases
9
Marital Status Variable
No significant
correlations of
default risk and
marital status
10
Credit Limit Variable
Higher credit limits,
lower default risk.
11
EDA Summary
● Demographic factors that impact default risk are:
○ Education: Higher education is associated with lower default risk.
○ Age: Customers aged 30-50 have the lowest default risk.
○ Sex: Females have lower default risk than males in this dataset.
○ Credit limit: Higher credit limit is associated with lower default risk.
12
Part 2
Predictive Modeling
What precision and recall
scores can the models
achieve?
13
Modeling Overview
14
Define Problem: Supervised learning / binary classification
78% non-default vs. 22% default
Imbalanced Classes:
Tools Used: Scikit learn library and imblearn
Models Applied: Logistic Regression / Random Forest / XGBoost
Modeling Steps
Data Preprocessing
● Feature selection
● Feature engineering
● Train-test data splitting
(70%/30%)
● Training data rescaling
● SMOTE oversampling
Fitting and Tuning
● Start with default model
parameters
● Hyperparameters tuning
● Measure ROC_AUC on
training data
Model Evaluation
● Models testing
● Precision_Recall score
● Compare with sklearn
dummy classifier
● Compare within the 3
models
15
Correct Imbalanced Classes
16
Models AUC Without SMOTE AUC With SMOTE
Logistic Regression 0.726 0.797
Random Forest 0.764 0.916
XGBoost 0.762 0.899
● Fit every model without and with SMOTE oversampling for comparison.
● Training AUC scores improved significantly with SMOTE.
Hyperparameters Tuning
17
● K-Fold Cross Validation to get average performance on the folds.
● Randomized Search on Logistic Regression since C has large search space.
● Grid Search on Random Forest on limited parameters combinations.
● Randomized Search on XGBoost because multiple hyperparameters to tune.
Model Comparisons
18
Models Precision Recall F1 Score Conclusion
Dummy Model 0.217 0.500 0.303 Benchmark
Logistic Regression 0.384 0.566 0.457 Best recall
Random Forest 0.513 0.514 0.514 Best F1
XGBoost 0.444 0.505 0.474
● Compare the models to Scikit-learn’s dummy classifier.
● All models performed better than dummy model.
Model Comparisons
19
● Compare within 3 models.
● Random Forest (black line) has the best precision_recall score.
★ Recall: how many 1s are
being identified?
★ Precision: Among all the
1s that are flagged, how
many are truly 1s?
★ Precision and recall
trade-off: high recall will
cause low precision
Terminology:
Model Usage - Recommendation
20
● I.e. recall = 0.8. Threshold can be adjusted to reach higher recall.
Feature Importances
21
Best model Random Forest
feature importances plot.
★ PAY_1: most recent
month’s payment
status.
★ PAY_2: the month prior
to current month’s
payment status.
★ BILL_AMT1: most
recent month’s bill
amount.
★ LIMIT_BAL: credit limit
Limitations & Future Work
22
Limitations
● Best model Random Forest can only
detect 51% of default.
● Model can only be served as an aid in
decision making instead of replacing
human decision.
● Used only 30,000 records and not
from US consumers.
Future Work
● Models are not exhaustive. Other
models could perform better.
● Get more computational resources to
tune XGBoost parameters.
● Acquire US customer data and more
useful features.I.e.customer income.
Conclusions
23
● Recent 2 payment status and credit limit are the strongest default
predictors.
● Dormant customers can also have default risk.
● Random Forest has the best precision and recall balance.
● Higher recall can be achieved if low precision is acceptable.
● Model can be served as an aid to human decision.
● Suggest output probabilities rather than predictions.
● Model can be improved with more data and computational resources.
Thank you!
Teresa Nan
Email: fenglin.nan@gmail.com
LinkedIn: https://www.linkedin.com/in/teresa-n-39287042/
Github: https://github.com/teresanan/capstone_project_credit_card_default/tree/master
Project report: https://github.com/teresanan/capstone_project_credit_card_default/blob/master/Final_Report.pdf
24

More Related Content

Similar to Credit Card Default Risk

Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionSaad Elbeleidy
 
Ledger Alchemy 255 Data mining.pdf
Ledger Alchemy 255 Data mining.pdfLedger Alchemy 255 Data mining.pdf
Ledger Alchemy 255 Data mining.pdfpatiladiti752
 
Deep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDeep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDatabricks
 
Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Vatsal N Shah
 
How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversionsSudeep Shukla
 
Combining Linear and Non Linear Modeling Techniques
Combining Linear and Non Linear Modeling Techniques Combining Linear and Non Linear Modeling Techniques
Combining Linear and Non Linear Modeling Techniques Salford Systems
 
Zeotap: Data Modeling in Druid for Non temporal and Nested Data
Zeotap: Data Modeling in Druid for Non temporal and Nested DataZeotap: Data Modeling in Druid for Non temporal and Nested Data
Zeotap: Data Modeling in Druid for Non temporal and Nested DataImply
 
Jay Budzik, Ai4 Finance, Aug 21, 2019
Jay Budzik, Ai4 Finance, Aug 21, 2019Jay Budzik, Ai4 Finance, Aug 21, 2019
Jay Budzik, Ai4 Finance, Aug 21, 2019Bruce Upbin
 
Better Living Through Analytics - Louis Cialdella Product School
Better Living Through Analytics - Louis Cialdella Product SchoolBetter Living Through Analytics - Louis Cialdella Product School
Better Living Through Analytics - Louis Cialdella Product SchoolLouis Cialdella
 
NPTL - Machine Learning by Madhur Jatiya.pdf
NPTL - Machine Learning by Madhur Jatiya.pdfNPTL - Machine Learning by Madhur Jatiya.pdf
NPTL - Machine Learning by Madhur Jatiya.pdfMr. Moms
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsBigML, Inc
 
achine Learning and Model Risk
achine Learning and Model Riskachine Learning and Model Risk
achine Learning and Model RiskQuantUniversity
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
Credit Card Marketing Classification Trees Fr.docx
 Credit Card Marketing Classification Trees Fr.docx Credit Card Marketing Classification Trees Fr.docx
Credit Card Marketing Classification Trees Fr.docxShiraPrater50
 
Modeling for the Non-Statistician
Modeling for the Non-StatisticianModeling for the Non-Statistician
Modeling for the Non-StatisticianAndrew Curtis
 
When the AIs failures send us back to our own societal biases
When the AIs failures send us back to our own societal biasesWhen the AIs failures send us back to our own societal biases
When the AIs failures send us back to our own societal biasesClément DUFFAU
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - ReportAkanksha Gohil
 

Similar to Credit Card Default Risk (20)

Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Ledger Alchemy 255 Data mining.pdf
Ledger Alchemy 255 Data mining.pdfLedger Alchemy 255 Data mining.pdf
Ledger Alchemy 255 Data mining.pdf
 
Deep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDeep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle Grove
 
Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversions
 
Combining Linear and Non Linear Modeling Techniques
Combining Linear and Non Linear Modeling Techniques Combining Linear and Non Linear Modeling Techniques
Combining Linear and Non Linear Modeling Techniques
 
Zeotap: Data Modeling in Druid for Non temporal and Nested Data
Zeotap: Data Modeling in Druid for Non temporal and Nested DataZeotap: Data Modeling in Druid for Non temporal and Nested Data
Zeotap: Data Modeling in Druid for Non temporal and Nested Data
 
Jay Budzik, Ai4 Finance, Aug 21, 2019
Jay Budzik, Ai4 Finance, Aug 21, 2019Jay Budzik, Ai4 Finance, Aug 21, 2019
Jay Budzik, Ai4 Finance, Aug 21, 2019
 
Better Living Through Analytics - Louis Cialdella Product School
Better Living Through Analytics - Louis Cialdella Product SchoolBetter Living Through Analytics - Louis Cialdella Product School
Better Living Through Analytics - Louis Cialdella Product School
 
NPTL - Machine Learning by Madhur Jatiya.pdf
NPTL - Machine Learning by Madhur Jatiya.pdfNPTL - Machine Learning by Madhur Jatiya.pdf
NPTL - Machine Learning by Madhur Jatiya.pdf
 
Data science guide
Data science guideData science guide
Data science guide
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 Sessions
 
achine Learning and Model Risk
achine Learning and Model Riskachine Learning and Model Risk
achine Learning and Model Risk
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Credit Card Marketing Classification Trees Fr.docx
 Credit Card Marketing Classification Trees Fr.docx Credit Card Marketing Classification Trees Fr.docx
Credit Card Marketing Classification Trees Fr.docx
 
Modeling for the Non-Statistician
Modeling for the Non-StatisticianModeling for the Non-Statistician
Modeling for the Non-Statistician
 
When the AIs failures send us back to our own societal biases
When the AIs failures send us back to our own societal biasesWhen the AIs failures send us back to our own societal biases
When the AIs failures send us back to our own societal biases
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 

Recently uploaded

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 

Recently uploaded (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 

Credit Card Default Risk

  • 1. Credit Card Default Risk Analysis A Case Study of 3 Classification Algorithms Teresa Nan June 2020 Project mentor: Tommy Blanchard 1
  • 2. Problems to Resolve Problem Statement ● ML applications focused on credit score predicting. ● Relying on credit scores and credit history. ● Miss valuable customers with no credit history. I.e. immigrants. ● Regulatory constraints on banking industry forbids some ML algorithms. 2 Purpose of Project ● Conduct quantitative analysis on credit default risk by applying three interpretable machine learning models without utilizing credit score or credit history.
  • 3. Who Should Care? Credit Card Companies Commercial Banks * Image source: Google image 3
  • 4. Approach Overview Data Cleaning Understand and Clean ● Find information on undocumented columns values ● Clean data to get it ready for analysis Data Exploration Graphical and Statistical ● Exam data with visualization ● Verify findings with statistical tests Predictive Modeling Machine Learning ● Logistic Regression ● Random Forest ● XGBoost 4
  • 5. Data Acquisition Real Credit Card Usage Data 5 Why This Dataset? ● Real credit card data ● Comprehensive and complete ● 30,000 customers ● Usage of 6 months ● Age from 20-79 ● Demographic factors ● No credit score or credit history *Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Dataset ● Default Payments of Credit Card Clients in Taiwan from 2005 ● Source: Public dataset from Kaggle. ● Original Source: UCI Machine Learning Repository*
  • 6. Part 1 Exploratory Data Analysis What demographic factors impact payment default risk? 6
  • 7. Gender Variable 30% of males and 26% of females have payment default. 7
  • 8. Education Variable Higher education level, lower default risk. 8 “Others” only consists 1.56% of total customers even if they appear to have the least default.
  • 9. Age Variable 30-50: Lowest risk < 30 or >50: Risk increases 9
  • 10. Marital Status Variable No significant correlations of default risk and marital status 10
  • 11. Credit Limit Variable Higher credit limits, lower default risk. 11
  • 12. EDA Summary ● Demographic factors that impact default risk are: ○ Education: Higher education is associated with lower default risk. ○ Age: Customers aged 30-50 have the lowest default risk. ○ Sex: Females have lower default risk than males in this dataset. ○ Credit limit: Higher credit limit is associated with lower default risk. 12
  • 13. Part 2 Predictive Modeling What precision and recall scores can the models achieve? 13
  • 14. Modeling Overview 14 Define Problem: Supervised learning / binary classification 78% non-default vs. 22% default Imbalanced Classes: Tools Used: Scikit learn library and imblearn Models Applied: Logistic Regression / Random Forest / XGBoost
  • 15. Modeling Steps Data Preprocessing ● Feature selection ● Feature engineering ● Train-test data splitting (70%/30%) ● Training data rescaling ● SMOTE oversampling Fitting and Tuning ● Start with default model parameters ● Hyperparameters tuning ● Measure ROC_AUC on training data Model Evaluation ● Models testing ● Precision_Recall score ● Compare with sklearn dummy classifier ● Compare within the 3 models 15
  • 16. Correct Imbalanced Classes 16 Models AUC Without SMOTE AUC With SMOTE Logistic Regression 0.726 0.797 Random Forest 0.764 0.916 XGBoost 0.762 0.899 ● Fit every model without and with SMOTE oversampling for comparison. ● Training AUC scores improved significantly with SMOTE.
  • 17. Hyperparameters Tuning 17 ● K-Fold Cross Validation to get average performance on the folds. ● Randomized Search on Logistic Regression since C has large search space. ● Grid Search on Random Forest on limited parameters combinations. ● Randomized Search on XGBoost because multiple hyperparameters to tune.
  • 18. Model Comparisons 18 Models Precision Recall F1 Score Conclusion Dummy Model 0.217 0.500 0.303 Benchmark Logistic Regression 0.384 0.566 0.457 Best recall Random Forest 0.513 0.514 0.514 Best F1 XGBoost 0.444 0.505 0.474 ● Compare the models to Scikit-learn’s dummy classifier. ● All models performed better than dummy model.
  • 19. Model Comparisons 19 ● Compare within 3 models. ● Random Forest (black line) has the best precision_recall score. ★ Recall: how many 1s are being identified? ★ Precision: Among all the 1s that are flagged, how many are truly 1s? ★ Precision and recall trade-off: high recall will cause low precision Terminology:
  • 20. Model Usage - Recommendation 20 ● I.e. recall = 0.8. Threshold can be adjusted to reach higher recall.
  • 21. Feature Importances 21 Best model Random Forest feature importances plot. ★ PAY_1: most recent month’s payment status. ★ PAY_2: the month prior to current month’s payment status. ★ BILL_AMT1: most recent month’s bill amount. ★ LIMIT_BAL: credit limit
  • 22. Limitations & Future Work 22 Limitations ● Best model Random Forest can only detect 51% of default. ● Model can only be served as an aid in decision making instead of replacing human decision. ● Used only 30,000 records and not from US consumers. Future Work ● Models are not exhaustive. Other models could perform better. ● Get more computational resources to tune XGBoost parameters. ● Acquire US customer data and more useful features.I.e.customer income.
  • 23. Conclusions 23 ● Recent 2 payment status and credit limit are the strongest default predictors. ● Dormant customers can also have default risk. ● Random Forest has the best precision and recall balance. ● Higher recall can be achieved if low precision is acceptable. ● Model can be served as an aid to human decision. ● Suggest output probabilities rather than predictions. ● Model can be improved with more data and computational resources.
  • 24. Thank you! Teresa Nan Email: fenglin.nan@gmail.com LinkedIn: https://www.linkedin.com/in/teresa-n-39287042/ Github: https://github.com/teresanan/capstone_project_credit_card_default/tree/master Project report: https://github.com/teresanan/capstone_project_credit_card_default/blob/master/Final_Report.pdf 24