This is a project that I worked on as a Capstone for my Masters in Business Analytics program at the University of Cincinnati. In this project, I have performed an end-to-end data mining exercise including data cleaning, distribution analysis, exploratory data analysis, model building etc. to identify and predict Credit Card defaults using Customer's data on past payments and general profile. In the process for building Machine Learning models, I have fit and compared the performance of multiple models and algorithms like Logistic Regreesion, PCA, Classification tree, AdaBoost Classifier, ANN and LDA.
Machine Learning Project - Default credit card clients Vatsal N Shah
- The model we built here will use all possible factors to predict data on customers to find who are defaulters and non‐defaulters next month.
- The goal is to find the whether the clients are able to pay their next month credit amount.
- Identify some potential customers for the bank who can settle their credit balance.
- To determine if their customers could make the credit card payments on‐time.
- Default is the failure to pay interest or principal on a loan or credit card payment.
Default credit cards are an important issue that bring negative consequences to both sides, i.e, banks and customer. If a customer does not pay his obligations, banks loose money, the customer will lose credibility in future payments, collection calls start to be made and in last resort, the case may go into the court. In order to avoid all of that trouble, effective methods that are able to predict the default of credit cards are needed. Therefore, default credit card prediction is an important, challenging and useful task that should be addressed.
This presentation documents how the problem can be addressed, following the pipeline of a typical Patter Recognition application. The main task is to classify a set of samples representing the history of payments and bill statements of a given client plus some background information about the client according to its ability to pay or not (Default) the next monthly payment of its credit card.
Loan Prediction system is a system which provides you a interface for loan approval to the applicants application of loan. Applicants provides the system about their personal information and according to their information system gives his status of availability of loan.
Loan Default Prediction with Machine LearningAlibaba Cloud
See webinar recording of this presentation at: https://resource.alibabacloud.com/webinar/detail.htm?webinarId=50
This webinar is designed to help users understand the end-to-end data science processes of using a propensity model on Alibaba Cloud’s Machine Learning Platform for AI; from defining the business problem, exploratory data analysis, data processing, model training to testing and deployment. You get an end-to-end case study (including a live demo) on how to use Alibaba Cloud products to predict the propensity of loan defaults.
Learn more about Machine Learning Platform for AI:
https://www.alibabacloud.com/product/machine-learning
Loan default prediction with machine language Aayush Kumar
Deafult-Loan-Prediction-Project-Using-Random-Forest-and-Decision-Tree
Deafult Loan Prediction Project Using Random Forest and Decision Tree, In This Project we use loan data from Leanding Club Random Forest Project - Deafult Loan Prediction For this project we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.
Certain Cases of Customers default on Payments in Taiwan.
From a Risk Management Perspective a Bank/Credit Card Company is more interested in minimizing their losses towards a particular customer.
The information that is more valuable to them is estimating the probability of default rather than classifying a customer as credible/not credible.
Goal: To compute the predictive accuracy of probability of default for a Taiwanese Credit Card Client.
Problem Analysis – Classify Probability of default for next month: 1 as “Default” and 0 as “Not Default”.
Machine Learning Project - Default credit card clients Vatsal N Shah
- The model we built here will use all possible factors to predict data on customers to find who are defaulters and non‐defaulters next month.
- The goal is to find the whether the clients are able to pay their next month credit amount.
- Identify some potential customers for the bank who can settle their credit balance.
- To determine if their customers could make the credit card payments on‐time.
- Default is the failure to pay interest or principal on a loan or credit card payment.
Default credit cards are an important issue that bring negative consequences to both sides, i.e, banks and customer. If a customer does not pay his obligations, banks loose money, the customer will lose credibility in future payments, collection calls start to be made and in last resort, the case may go into the court. In order to avoid all of that trouble, effective methods that are able to predict the default of credit cards are needed. Therefore, default credit card prediction is an important, challenging and useful task that should be addressed.
This presentation documents how the problem can be addressed, following the pipeline of a typical Patter Recognition application. The main task is to classify a set of samples representing the history of payments and bill statements of a given client plus some background information about the client according to its ability to pay or not (Default) the next monthly payment of its credit card.
Loan Prediction system is a system which provides you a interface for loan approval to the applicants application of loan. Applicants provides the system about their personal information and according to their information system gives his status of availability of loan.
Loan Default Prediction with Machine LearningAlibaba Cloud
See webinar recording of this presentation at: https://resource.alibabacloud.com/webinar/detail.htm?webinarId=50
This webinar is designed to help users understand the end-to-end data science processes of using a propensity model on Alibaba Cloud’s Machine Learning Platform for AI; from defining the business problem, exploratory data analysis, data processing, model training to testing and deployment. You get an end-to-end case study (including a live demo) on how to use Alibaba Cloud products to predict the propensity of loan defaults.
Learn more about Machine Learning Platform for AI:
https://www.alibabacloud.com/product/machine-learning
Loan default prediction with machine language Aayush Kumar
Deafult-Loan-Prediction-Project-Using-Random-Forest-and-Decision-Tree
Deafult Loan Prediction Project Using Random Forest and Decision Tree, In This Project we use loan data from Leanding Club Random Forest Project - Deafult Loan Prediction For this project we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.
Certain Cases of Customers default on Payments in Taiwan.
From a Risk Management Perspective a Bank/Credit Card Company is more interested in minimizing their losses towards a particular customer.
The information that is more valuable to them is estimating the probability of default rather than classifying a customer as credible/not credible.
Goal: To compute the predictive accuracy of probability of default for a Taiwanese Credit Card Client.
Problem Analysis – Classify Probability of default for next month: 1 as “Default” and 0 as “Not Default”.
What is Predictive Analytics?
Predictive Analytics is the stream of the advanced analytics which utilizes diverse techniques like data mining, predictive modelling, statistics, machine learning and artificial intelligence to analyse current data and predict future.
To Know more: https://goo.gl/zAcnCR
LOAN DEFAULT PREDICTION – A CASE STUDY
Content Covered in this video:
Business Problem & Benefits
The Risk - LOAN DEFAULT PREDICTION
Data Analysis Process
Data Processing
Predictive Analysis Process
Tools & Technology
Build Intelligent Fraud Prevention with Machine Learning and GraphsNeo4j
See how financial services, banking and retail are using graph-enhanced machine learning to thwart fraud. Fraudsters are becoming increasingly sophisticated, organized and adaptive; traditional, rule-based solutions are not broad or nimble enough to deal with this reality. This session will cover several demonstrations and real-world technical examples including preventing credit card fraud, identifying money laundering and reducing false positives.
The capstone project is a Machine Learning application that creates a model for a famous bank in New Jersey.
It analyzes their Clients who took loans in their bank based on various parameters.
This project aims at predicting Defaulters of Credit Card Payment. R programming is used for Exploratory Data Analysis and for Model building R programming and Azure ML is used.
AI powered Decision Making in Banks - How Banks today are using Advanced analytics in credit Decisioning, enhancing customer life time value, lower operating costs and stronger customer acquisition
Credit Card Fraudulent Transaction Detection Research PaperGarvit Burad
Credit Card Fraudulent Transaction Detection Research Paper using Machine Learning technologies like Logistic Regression, Random Forrest, Feature Engineering and various techniques to deal with highly skewed dataset
https://www.youtube.com/watch?v=eXtWRkfMisM
Durante el 2012 el nivel de fraude en tarjeta de crédito llego a 11.3 billones de dólares, un aumento de casi un 15% comparado con el 2011, esto demuestra el problema que el fraude representa no solo a las instituciones financieras sino también para la sociedad. Tradicionalmente la prevención del fraude consistía en proteger físicamente la infraestructura, sin embargo con cada vez más medios y canales de pago, la información financiera se ha vuelto cada vez más susceptible a ser hurtada. La siguiente opción para prevenir y controlar el fraude consiste en determinar si una transacción está siendo realizada por el cliente de acuerdo con sus patrones históricos de comportamiento. Este es el enfoque de Fraud Analytics.
En esta presentación se mostrara cómo es posible por medio de Fraud Analytics, determinar la probabilidad que una transacción sea o no realizada por el cliente, utilizando la información de compra de los clientes, sus interacciones con la entidad financiera, y por medio de análisis de redes sociales. Adicionalmente, se discutirán y compararan los resultados de las comúnmente utilizadas reglas de decisión y modelos avanzados de Machine Learning.
Consumer Credit Scoring Using Logistic Regression and Random ForestHirak Sen Roy
Project Details: In this study, the concept and application of credit scoring in a German banking environment is
explained. A credit scoring model has been developed using logistic regression and random forest. Limitations of
the model are explained and possible solutions are given with an overview of LASSO.
Guide: Dr. Sibnarayan Guria, Associate Professor and Head of the Department, Department of
Statistics, West Bengal State University
Language Used: R
What is Predictive Analytics?
Predictive Analytics is the stream of the advanced analytics which utilizes diverse techniques like data mining, predictive modelling, statistics, machine learning and artificial intelligence to analyse current data and predict future.
To Know more: https://goo.gl/zAcnCR
LOAN DEFAULT PREDICTION – A CASE STUDY
Content Covered in this video:
Business Problem & Benefits
The Risk - LOAN DEFAULT PREDICTION
Data Analysis Process
Data Processing
Predictive Analysis Process
Tools & Technology
Build Intelligent Fraud Prevention with Machine Learning and GraphsNeo4j
See how financial services, banking and retail are using graph-enhanced machine learning to thwart fraud. Fraudsters are becoming increasingly sophisticated, organized and adaptive; traditional, rule-based solutions are not broad or nimble enough to deal with this reality. This session will cover several demonstrations and real-world technical examples including preventing credit card fraud, identifying money laundering and reducing false positives.
The capstone project is a Machine Learning application that creates a model for a famous bank in New Jersey.
It analyzes their Clients who took loans in their bank based on various parameters.
This project aims at predicting Defaulters of Credit Card Payment. R programming is used for Exploratory Data Analysis and for Model building R programming and Azure ML is used.
AI powered Decision Making in Banks - How Banks today are using Advanced analytics in credit Decisioning, enhancing customer life time value, lower operating costs and stronger customer acquisition
Credit Card Fraudulent Transaction Detection Research PaperGarvit Burad
Credit Card Fraudulent Transaction Detection Research Paper using Machine Learning technologies like Logistic Regression, Random Forrest, Feature Engineering and various techniques to deal with highly skewed dataset
https://www.youtube.com/watch?v=eXtWRkfMisM
Durante el 2012 el nivel de fraude en tarjeta de crédito llego a 11.3 billones de dólares, un aumento de casi un 15% comparado con el 2011, esto demuestra el problema que el fraude representa no solo a las instituciones financieras sino también para la sociedad. Tradicionalmente la prevención del fraude consistía en proteger físicamente la infraestructura, sin embargo con cada vez más medios y canales de pago, la información financiera se ha vuelto cada vez más susceptible a ser hurtada. La siguiente opción para prevenir y controlar el fraude consiste en determinar si una transacción está siendo realizada por el cliente de acuerdo con sus patrones históricos de comportamiento. Este es el enfoque de Fraud Analytics.
En esta presentación se mostrara cómo es posible por medio de Fraud Analytics, determinar la probabilidad que una transacción sea o no realizada por el cliente, utilizando la información de compra de los clientes, sus interacciones con la entidad financiera, y por medio de análisis de redes sociales. Adicionalmente, se discutirán y compararan los resultados de las comúnmente utilizadas reglas de decisión y modelos avanzados de Machine Learning.
Consumer Credit Scoring Using Logistic Regression and Random ForestHirak Sen Roy
Project Details: In this study, the concept and application of credit scoring in a German banking environment is
explained. A credit scoring model has been developed using logistic regression and random forest. Limitations of
the model are explained and possible solutions are given with an overview of LASSO.
Guide: Dr. Sibnarayan Guria, Associate Professor and Head of the Department, Department of
Statistics, West Bengal State University
Language Used: R
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING mlaij
Nowadays, There are many risks related to bank loans, for the bank and for those who get the loans. The
analysis of risk in bank loans need understanding what is the meaning of risk. In addition, the number of
transactions in banking sector is rapidly growing and huge data volumes are available which represent
the customers behavior and the risks around loan are increased. Data Mining is one of the most motivating
and vital area of research with the aim of extracting information from tremendous amount of accumulated
data sets. In this paper a new model for classifying loan risk in banking sector by using data mining. The
model has been built using data form banking sector to predict the status of loans. Three algorithms have
been used to build the proposed model: j48, bayesNet and naiveBayes. By using Weka application, the
model has been implemented and tested. The results has been discussed and a full comparison between
algorithms was conducted. J48 was selected as best algorithm based on accuracy.
MSc research project report - Optimisation of Credit Rating Process via Machi...AmarnathVenkataraman
Optimization of Credit rating process via Machine Learning
The credit rating process is considered to be one of the vital processes that defenses the global economy. The majority of investments will be obtained based on these credit ratings which acts as the representation of the financial credibility of companies. As the current credit rating process found to be expensive, small and medium-sized enterprises(SMEs) which are considered to be the backbone of the global economy might find it difficult to access the funds via investment for their development which in turn affects the global economy as well. This issue might be solved with the outcome of this research in terms of the optimized credit rating system with improved accuracy and continuous credit rating transition. Support Vector Machine(SVM) managed to achieve the highest accuracy of 92.0% whereas Random Forest(RF) and C5.0 decision tree also achieved greater accuracies with different formats of the dataset. With the help of dictionary-based sentiment analysis, this research proved that a continuous credit rating transition system could track the changes in the financial status of the company which in turn helps to predict the crisis like bankruptcy and default in prior.
https://ijitce.com/index.php
Our journal maintains rigorous peer review standards. Each submitted article undergoes a thorough evaluation by experts in the respective field. This stringent review process helps ensure that only high-quality and scientifically sound research is accepted for publication. Researchers can trust that the articles they find in IJITCE have been critically assessed for validity, significance, and originality.
Our goal with this commentary is to put common credit score myths to rest, and to shed some light on what the insights and research from Equifax proves to be true.
Data Mining, Statistical Analysis, Clustering and segmentation, profiling, determining CLV (customer lifetime value), and validating the results and creating reports with executive summaries and provide recommendations for a given business scenario.
EXAMINING IMPACTS OF BIG DATA ANALYTICS ON CONSUMER FINANCE: A CASE OF CHINAIJMIT JOURNAL
The use of Big Data analytics for business improvements is a vital strategy for survival. In this paper, we
report a study that investigates the role of BD analytics on consumer finance, credit card finance in
China—a research area that has largely remained unexplored. The largeness and diversity of Chinese
consumer market merit an urgent attention and understanding of role of BD analytics is significant both
theoretically and managerially. This study achieves that target. Given the exploratory nature of study, we
take a qualitative approach. We conduct approximately 30 interviews with baking and finance sector
respondents. The data will be recorded, transcribed and translated. We will analyze data using content
analysis / thematic analysis technique.
Examining impacts of big data analytics on consumer finance a case of chinaIJMIT JOURNAL
The use of Big Data analytics for business improvements is a vital strategy for survival. In this paper, we report a study that investigates the role of BD analytics on consumer finance, credit card finance in China—a research area that has largely remained unexplored. The largeness and diversity of Chinese
consumer market merit an urgent attention and understanding of role of BD analytics is significant both theoretically and managerially. This study achieves that target. Given the exploratory nature of study, we take a qualitative approach. We conduct approximately 30 interviews with baking and finance sector respondents. The data will be recorded, transcribed and translated. We will analyze data using content analysis / thematic analysis technique.
With flickery markets, edgy economy, organizational change and the evolving regulatory landscape, the finance divisions are caught up in a fast increase in the amount of public support and changes. All this while, the need for cost cutting and delivering transparent reports stays stable. Rolta’s Financial Analytics solution CFO Impact helps you bring cost effective and sustainable transformations to financial processes and systems with the help of big data analytic technologies.
Broadening Bill Payment Adoption With Photo-Based Payment: Quantifying the Be...Mitek
Released: 2013
Financial institutions considering adoption of smartphone-image-based bill payment must undertake two analytical tasks to evaluate the potential return on such an investment. First, a bank must determine the achievability and quantifiability of each benefit to determine whether each improvement should be included in a metrics-based business case. Second, the value of the benefits most easily achieved and most readily quantified should be quantified and the return on investment (ROI) for the potential deployment estimated.
This Aite white paper explores the potential benefits to banks of using photo-based bill payment to capture a larger portion of consumers' bill payment activity and assists in quantifying such potential benefits.
P2P Lending Business Research by Artivatic.aiArtivatic.ai
Financial Lending or P2P Lending is going to play important role in the economy of entire world including India. Artivatic conducted Lending (P2P) research to understand the sector specific problems, growth and opportunities and also the use of technologies.
#lending #p2p #fintech #banking #insurance #payments #accounts #bfsi #deeptech #artivatic #startups #technology
Data Science Use Cases in The Banking and Finance SectorSofiaCarter4
Utilizing data science in the banking and financial industry is no longer merely a fad. Data science is having a significant impact on the banking and financial sectors. Let's take a quick look at this trend.
Similar to Predicting Credit Card Defaults using Machine Learning Algorithms (20)
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Predicting Credit Card Defaults using Machine Learning Algorithms
1. 1
MS-CAPSTONE
(BANA 6064)
CARL H. LINDNER COLLEGE OF BUSINESS
SUMMER 2016
PREDICTING CREDIT CARD DEFAULTS
Understanding the concept of default, why it happens and the components used to predict the
default of credit card holders
Submitted in Partial Fulfillment for the Requirements for the Degree of Master of Science in
Business Analytics
TO:
Prof. Yichen Qin (1st Reader)
Prof. Peng Wang (2nd Reader)
BY:
Sagar Vinaykumar Tupkar
tupkarsr@mail.uc.edu
M08773948
2. 2
ABSTRACT
Credit Card defaults poses a major problem to all the major financial service providers today as they have
to invest a lot of money in collection strategy, which again is uncertain. The analysts in financial industry
today have achieved great success in plotting a method to predict the default of credit card holder based
on various factors. This study aims at using the previous 6 months’ data of the customer to predict
whether the customer will go default in the next month by various statistical and data mining techniques
and building different models for the same. The exploratory data analysis part is also important to check
the distributions and patterns followed by the customers which eventually lead to default. Out of the four
models built, Logistic Regression after doing Principal Component Analysis and Adaptive Boosting
Classifier performed the best in predicting defaults with around 83% accuracy and minimizing the penalty
to the company. This study gave list of important variables that affects the model and should be
considered for predicting defaults. Even though the accuracy of the predictions is good, further research
and powerful techniques can potentially enhance the results and bring a revolution in the credit card
industry.
3. 3
Contents
ABSTRACT..................................................................................................................................................2
1. INTRODUCTION.................................................................................................................................4
1.1. Credit-Card Default Definition – ...............................................................................................4
1.2. Background and Current Situation of Credit Card Defaults –...................................................4
2. OBJECTIVE OF THE STUDY –..............................................................................................................4
3. DATA .................................................................................................................................................5
4. EXPLORATARY DATA ANALYSIS.........................................................................................................9
4.1. Gender based Distribution:.....................................................................................................10
4.2. Education based Distribution:.................................................................................................10
4.3. Age based distribution:...........................................................................................................11
4.4. Marital Status based Distribution:..........................................................................................11
4.5. Credit-Line based distribution: ...............................................................................................12
4.6. Distribution of Payment statistics in October 2015................................................................13
4.7. Distribution of Payment statistics in November 2015............................................................13
4.8. Distribution of Payment statistics in December 2015 ............................................................14
4.9. Distribution of Payment statistics in January 2016.................................................................15
4.10. Distribution of Payment statistics in February 2016...........................................................15
4.11. Distribution of Payment statistics in March 2016...............................................................16
5. MODEL PREPARATION ....................................................................................................................17
5.1. Logistic Regression Model.......................................................................................................17
5.2. Classification Tree ...................................................................................................................22
5.3. Artificial Neural Network ........................................................................................................26
5.4. Linear Discriminant Analysis ...................................................................................................26
6. MODEL COMPARISON.....................................................................................................................27
7. CONCLUSION...................................................................................................................................28
8. REFERENCES –.................................................................................................................................29
4. 4
1. INTRODUCTION
1.1. Credit-Card Default Definition –
When a customer applies for and receive a credit card, it becomes a huge responsibility for customer as
well as the credit card issuing company. The credit card company evaluates the customer’s credit
worthiness and gives him/her a line of credit that they feel the customer can be responsible for. While
most people will use their card to make purchases and then diligently make payments on what they
charge, there are some people who, for one reason or another, do not keep up on their payments and
eventually go into credit card default.
Credit card default is the term used to describe what happens when a credit card user makes purchases
by charging them to their credit card and then they do not pay their bill. It can occur when one payment
is more than 30 days past due, which may raise your interest rate. Most of the time, the term default is
used informally when the credit card payment is more than 60 days past due. A default has a negative
impact on the credit report and most likely lead to higher interest rates on future borrowing.
1.2. Background and Current Situation of Credit Card Defaults –
The U.S. economy is growing at just 2.5% a year, but credit card lending is rising more than twice as fast:
5% over year-earlier levels each month since last fall, accelerating to 6% in March 2016 and April 2016,
says Federal Reserve data. That's the fastest card debt has grown since card lending fell in the 2009
recession and since Americans aren't earning that much more, won't delinquencies, charge-offs and
bankruptcies be rising in another year or two?
As a matter of fact, for the 9 dominant U.S. credit card banks, which control 70% of the Visa-MasterCard-
American Express-Discover-Chinapay market in the U.S., average charge-offs in early 2016 were 3.13% of
annualized average loans, "down from a peak of 9.9% in 2009."
In recent years, the credit card issuers are facing the cash and credit card debt crisis as they have been
over-issuing cash and credit cards to unqualified applicants, in order to increase their market share. At
the same time, most cardholders, irrespective of their repayment ability, overused credit card for
consumption and accumulated heavy credit and cash–card debts. The crisis is an omen for the blow to
consumer finance confidence and it is a big challenge for both banks and cardholders.
2. OBJECTIVE OF THE STUDY –
This project is an attempt to identify credit card customers who are more likely to default in the coming
month. A lot of credit card issuing companies are working on predictive models which would help them
predict the payment status of the customer ahead of time using the customer’s credit score, credit history,
payment history and other factors. A lot of statistical models to predict delinquency are extant in the
5. 5
financial industry today, however, as a famous quote goes, “All models are wrong, only some are useful”,
our attempt to build another predictive model has its own small impact in the research today.
This project is aimed at using customer’s personal and financial information like credit line, age,
repayment and delinquency history for the past 6 months to predict the probability of the particular
customer to become default next month. Many statistical and data mining techniques will be used to build
a binary predictive model.
If the credit card issuing companies can effectively predict the imminent default of customers beforehand,
it will help them to pursue targeted customers and take calculated efforts to avoid the default, to
overcome future losses efficiently.
3. DATA
As mentioned earlier, this project will use customer’s personal and financial information like credit line,
age, repayment and delinquency history for the past 6 months to predict the probability of the particular
customer to become default next month. The data was provided by one of the famous credit card issuing
banks of the USA and contains proprietary information about the customers (e.g. account numbers, which
have been masked). The data, in any sense, does not directly reveal identity of any individual or provide
information that could be decrypted to connect to an individual.
In this project, the plan is to predict the probability of credit-card holders to go default in the next month
by using payment data from October 2015 to March 2016. Among the total 30,000 observations, 6636
observations (22.12%) are the cardholders with default payment. To determine the binary variable –
default payment in April 2016 (Yes = 1, No = 0), as the response variable, the following 23 variables would
be used as explanatory variables:
1. X1: Amount of the given credit (USD): it includes both the individual consumer credit and his/her
family (supplementary) credit.
Including this variable in the study is important as the credit line of a customer is a good indicator
of the financial credit score of the customer. Using this variable will help the model predict
defaults more effectively.
2. X2: Gender (1 = male; 2 = female)
It might be useful to see whether the gender of the customer is in any way related to his/her
probability of default. The distribution of defaults based on gender will be an interesting chart to
look at.
3. X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others
It might be useful to see whether the education level of the customer is in any way related to
his/her probability of default. The distribution of defaults based on education level will be an
interesting chart to look at.
6. 6
4. X4: Marital status (1 = married; 2 = single; 3 = others)
It might be useful to see whether the marital status of the customer is in any way related to his/her
probability of default. The distribution of defaults based on marital status will be an interesting
chart to look at.
5. X5: Age (year)
It might be useful to see whether the gender of the customer is in any way related to his/her
probability of default. The distribution of defaults based on age buckets will be an interesting
chart to look at.
6. X6 - X11: History of past payment. Customers’ past monthly payment records (from October 2015
to March, 2016) were tracked and used in the dataset as follows:
X6 = the repayment status in March, 2016;
X7 = the repayment status in February, 2016;
. . .;
X11 = the repayment status in October, 2015.
The measurement scale for the repayment status is:
-2 = Minimum due payment scheduled for 60 days
-1 = Minimum due payment scheduled for 30 days
0 = pay duly;
1 = payment delay for one month;
2 = payment delay for two months;
. . .;
8 = payment delay for eight months and above;
This information is very crucial as it directly provides the payment status of the customer for the
past 6 months. Using these variables will train the model efficiently to predict defaults.
7. X12-X17: Amount of bill statement (USD)
X12 = amount of bill statement in March, 2016;
X13 = amount of bill statement in February, 2016;
. . .;
X17 = amount of bill statement in October, 2015.
7. 7
Actual bill statements of the customers for the past 6 months would give a quantitative estimate
for the amount spent by the customer using the credit card.
8. X18-X23: Amount of previous payment (USD)
X18 = amount paid in March, 2016;
X19 = amount paid in February, 2016;
. . .;
X23 = amount paid in October, 2015.
Amount of USD paid by the customers in past 6 months would give the repayment ability of the
customer and the pattern for payment could be used to train the model efficiently.
The variable names, example, data type and description is provided below in Figure 1.
Column name Variable Names Example Data Type Description
ID ID 23 Integer Masked Account numbers of Customers
Y default payment next month 1 Binary Binary variable (1,0) with 1 being customer defaults in next month
X1 LIMIT_BAL 2170 Continuous numeric Credit line of the customer
X2 SEX 2 Factor Gender of the customer 1= male 2=female
X3 EDUCATION 2 Factor Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)
X4 MARRIAGE 2 Factor Marital status (1 = married; 2 = single; 3 = others)
X5 AGE 26 Integer Age (year)
X6 PAY_1 2 Factor repayment status in March, 2016
X7 PAY_2 0 Factor repayment status in February, 2016
X8 PAY_3 0 Factor repayment status in January, 2016
X9 PAY_4 2 Factor repayment status in December, 2015
X10 PAY_5 2 Factor repayment status in November, 2015
X11 PAY_6 2 Factor repayment status in October, 2015
X12 BILL_AMT1 1273.70 Continuous numeric amount of bill statement in March, 2016 (in USD)
X13 BILL_AMT2 1315.80 Continuous numeric amount of bill statement in February, 2016 (in USD)
X14 BILL_AMT3 1395.62 Continuous numeric amount of bill statement in January, 2016 (in USD)
X15 BILL_AMT4 1364.19 Continuous numeric amount of bill statement in December, 2015 (in USD)
X16 BILL_AMT5 1454.06 Continuous numeric amount of bill statement in November, 2015 (in USD)
X17 BILL_AMT6 1426.37 Continuous numeric amount of bill statement in October, 2015 (in USD)
X18 PAY_AMT1 62.22 Continuous numeric amount paid in March, 2016 (in USD)
X19 PAY_AMT2 111.04 Continuous numeric amount paid in February, 2016 (in USD)
X20 PAY_AMT3 0.00 Continuous numeric amount paid in January, 2016 (in USD)
X21 PAY_AMT4 111.63 Continuous numeric amount paid in December, 2015 (in USD)
X22 PAY_AMT5 0.00 Continuous numeric amount paid in November, 2015 (in USD)
X23 PAY_AMT6 56.42 Continuous numeric amount paid in October, 2015 (in USD)
Figure 1: Data Dictionary
8. 8
In order to get the gist of the data, Figure 2 shows a snapshot of a subset of observations in the dataset,
Figure 3 provides the information like datatype and levels etc. about the variables and Figure 4 provides
a statistical summary of the data –
Figure 3: Description of variables in the dataset
Figure 2: Snapshot of the top 15 observations of the dataset
9. 9
Figure 4: Summary of the variables in the dataset
4. EXPLORATARY DATA ANALYSIS
From the description of the data above, it can be concluded that the data did not have any null values for
any of the variables. We will start by doing an initial exploratory data analysis by looking at the distribution
of different variables with Y=0 and Y=1; so that the behavior of default and non-default customers can be
analyzed.
The value of the variables was aggregated and the total was plotted against the number of customers;
hence a frequency chart was prepared for the class variables and insights were drawn from the
visualizations. The results were separately presented for Y=0 and Y=1 i.e. default and non-default
customers (red being default and green being non-default) so that the analysis becomes easier.
10. 10
4.1. Gender based Distribution:
The bar chart was plotted for distribution of customers based on their gender and is shown in Figure 5.
The result is separated by default and non-default customers (red being default and green being non-
default).
It can be observed that out of the 12k male credit card holders, 24.17% of the customers were default
whereas out of the 18k female credit card holders, 20.78% of the customers were default. Although the
total number of female customers are more than male customers, the percentage of male default
customers are more than that of the female customers.
4.2. Education based Distribution:
The bar chart was plotted for distribution of customers based on their education and is shown in Figure
6. The result is separated by default and non-default customers (red being default and green being non-
default).
Figure 6: Distribution by Education
Figure 5: Distribution by Gender
11. 11
It can be observed that most of the credit card holders (~14k) are University pass-outs followed by
Graduates and High School pass-outs. Although the percentage of default doesn’t vary significantly
amongst these customers based on their education, it is worth noticing that 25.16% of the high school
pass-out customers get defaulted while the number decreased to 23.73% for University and 19.23% for
Graduate School.
4.3. Age based distribution:
The customers were categorized into bins and bar chart plotted for distribution of customers based on
their age is shown in Figure 7. The result is separated by default and non-default customers (red being
default and green being non-default).
Figure 7: Distribution by Age
It can be observed that the credit card holders who are aged less than 25 years have the maximum
(~27.20%) of default proportion followed by the age group 45-60 years (25.13%). While the number of
customers in age group 25-35 years are maximum in numbers, their default proportion (~20.30%) is pretty
decent.
4.4. Marital Status based Distribution:
The bar chart was plotted for distribution of customers based on their marital status and is shown in Figure
8. The result is separated by default and non-default customers (red being default and green being non-
default).
12. 12
Figure 8: Distribution by Marital Status
It can be observed that the credit card holders who are married have the maximum (~23.47%) of default
proportion as compared to ~20.93% of Single customers. Although married customers are lesser in
number than single customers, the default proportion is higher amongst married credit card holders.
4.5. Credit-Line based distribution:
The customers were categorized into bins and bar chart plotted for distribution of customers based on
their credit line is shown in Figure 9. The result is separated by default and non-default customers (red
being default and green being non-default).
Figure 9: Distribution by Credit Line
13. 13
It can be observed that for the customers whose credit line is between $1000-$5000 are maximum in
number and have 2nd
highest default proportion (24.46%) after customers with credit line between $500-
$1000 which is alarmingly high (35.30%)
4.6. Distribution of Payment statistics in October 2015
For the month of October 2015, the number of credit card holders were distributed based on their
repayment history code above and is shown in Figure 10. Also, the total Statement Bill amount and the
amount paid by the credit card holder was divided into default and non-default customers to analyze their
situation separately and is shown below in Figure 11.
Figure 10: Repayment status distribution in Oct'15 Figure 11:Bill and Paid amount distribution in Oct'15
It can be observed that in the month of October 2015, while most of the customers paid their loan duly,
almost 50% of the customers who had a payment delay for 2 months went default in April 2016. On the
other hand, customers who went default in April 2015 paid only ~9% of the bill statement in October 2015
as opposed to ~14% by non-default customers.
4.7. Distribution of Payment statistics in November 2015
For the month of November 2015, the number of credit card holders were distributed based on their
repayment history code above and is shown Figure 12. Also, the total Statement Bill amount and the
amount paid by the credit card holder was divided into default and non-default customers to analyze their
situation separately and is shown below in Figure 13.
14. 14
Figure 12: Repayment status distribution in Nov'15 Figure 13: Bill and Paid amount distribution in Nov'15
It can be observed that in the month of November 2015, while most of the customers paid their loan duly,
almost 54% of the customers who had a payment delay for 2 months went default in April 2016. On the
other hand, customers who went default in April 2015 paid only ~8% of the bill statement in November
2015 as opposed to ~13% by non-default customers.
4.8. Distribution of Payment statistics in December 2015
For the month of November 2015, the number of credit card holders were distributed based on their
repayment history code above and is shown Figure 14. Also, the total Statement Bill amount and the
amount paid by the credit card holder was divided into default and non-default customers to analyze their
situation separately and is shown below in Figure 15.
Figure 14: Repayment status distribution in Dec'15 Figure 15: Bill and Paid amount distribution in Dec'15
15. 15
It can be observed that in the month of December 2015, while most of the customers paid their loan duly,
almost 52% of the customers who had a payment delay for 2 months went default in April 2016. On the
other hand, customers who went default in April 2015 paid only ~7.5% of the bill statement in December
2015 as opposed to ~12% by non-default customers.
4.9. Distribution of Payment statistics in January 2016
For the month of January 2016, the number of credit card holders were distributed based on their
repayment history code above and is shown in Figure 16. Also, the total Statement Bill amount and the
amount paid by the credit card holder was divided into default and non-default customers to analyze their
situation separately and is shown below in Figure 17.
Figure 16: Repayment status distribution in Jan'16 Figure 17: Bill and Paid amount distribution in Jan'16
It can be observed that in the month of January 2016, while most of the customers paid their loan duly,
almost 52% of the customers who had a payment delay for 2 months went default in April 2016. On the
other hand, customers who went default in April 2015 paid only ~7.5% of the bill statement in January
2016 as opposed to ~12% by non-default customers.
4.10. Distribution of Payment statistics in February 2016
For the month of February 2016, the number of credit card holders were distributed based on their
repayment history code above and is shown in Figure 18. Also, the total Statement Bill amount and the
amount paid by the credit card holder was divided into default and non-default customers to analyze their
situation separately and is shown below in Figure 19.
16. 16
Figure 18: Repayment status distribution in Feb'16 Figure 19: Bill and Paid amount distribution in Feb'16
It can be observed that in the month of February 2016, while most of the customers paid their loan duly,
almost 56% of the customers who had a payment delay for 2 months went default in April 2016. On the
other hand, customers who went default in April 2015 paid only ~7% of the bill statement in February
2016 as opposed to ~13% by non-default customers.
4.11. Distribution of Payment statistics in March 2016
For the month of March 2016, the number of credit card holders were distributed based on their
repayment history code above and is shown in Figure 20. Also, the total Statement Bill amount and
the amount paid by the credit card holder was divided into default and non-default customers to analyze
their situation separately and is shown below in Figure 21.
Figure 20: Repayment status distribution in Mar'16 Figure 21: Bill and Paid amount distribution in Mar'16
17. 17
It can be observed that in the month of March 2016, while most of the customers paid their loan duly,
almost 70% of the customers who had a payment delay for 2 months and 34% of the customers who had
a payment delay for 1 month went default in April 2016. On the other hand, customers who went default
in April 2015 paid only ~7% of the bill statement in March 2016 as opposed to ~12% by non-default
customers.
5. MODEL PREPARATION
The aim of this exercise is to build a model using the variables explained in the earlier section to predict
the credit card holders which would go default next month. The data that would be used to train the
model would be past 6 months financial, delinquency and payment history. Hence, in order to build the
model, we have divided the data into training (80%) and testing (20%) subsets. Multiple classifiers were
used to build the model using the training dataset which contained 24,000 observations and were
compared based on the various model performance metrics. We will go through each model separately
and discuss the scope, performance and pros-cons related to every classifier method.
5.1. Logistic Regression Model
Logistic regression can be considered a special case of linear regression models. However, the binary
response variable violates normality assumptions of general regression models. A logistic regression
model specifies that an appropriate function of the fitted probability of the event is a linear function of
the observed values of the available explanatory variables. The major advantage of this approach is that
it can produce a simple probabilistic formula of classification. The weaknesses are that LR cannot properly
deal with the problems of non-linear and interactive effects of explanatory variables.
A logistic regression model was fit on the training dataset using all the variables and the summary of the
model can be seen in Table 1 below –
Table 1: Logistic full model summary
Logistic Regression AIC Null Deviance Residual Deviance
Model Summary 20979 25314 20815
It was observed that some of the dummy variables that were created because of the presence of
class/nominal variables in the dataset were not significant in the above full logistic regression model and
hence a stepwise variable selection method for regression was performed.
5.1.1. Stepwise Variable Selection Method
By performing stepwise variable selection, it was observed that some of the variables were omitted
because they were insignificant in the full model. Finally, the new model was –
Y ~ X1 + X2 + X3 + X4 + X6 + X7 + X8 + X9 + X10 + X11 + X12 + X13 + X18 + X19 + X20 + X22
18. 18
It can be seen that the variables X5, X14, X15, X16, X17, X21 and X23 were omitted from the model. The
summary of the new model can be seen in Table 2 below –
Table 2: Logistic stepwise model summary
Logistic Stepwise Regression AIC Null Deviance Residual Deviance
Model Summary 20970 25314 20820
The variables that were omitted viz. age, bill statement amounts and paid amounts of few months are
important according to business knowledge and are highly recommended to be kept in the model to be
used. Moreover, even after removing these variables, the AIC of the model hasn’t decreased significantly.
5.1.2. LASSO variable selection –
In statistics and machine learning, lasso (least absolute shrinkage and selection operator) (also Lasso or
LASSO) is a regression analysis method that performs both variable selection and regularization in order
to avoid overfitting, enhance the prediction accuracy and interpretability of the statistical model it
produces. In this project, 5-fold cross validation method was used to select the Tuning Parameter (lambda)
that gets inside the LASSO optimization problem.
The entire data was used to do 5-fold cross validation and LASSO variable selection and the behavior of
binomial distribution for different tuning parameter- lambda was plotted. The optimum value of lambda
is given by the vertical line in the plot shown below. Hence the value of tuning parameter, lambda is
around 0.004.
Figure 22: Binomial Deviance plot to choose the tuning parameter-lambda
Now, using this lambda, variable selection was done using LASSO variable selection method. As far as the
Null Deviance of the model is concerned, it was 31704.85 and hence because of the higher value as
compared to full logistic and stepwise model, this model couldn’t be used.
19. 19
After comparing the three different versions of logistic model, it was concluded that the full logistic model
has better parameter values than the other two. Hence, we will use the full logistic regression model
instead of the reduced model or LASSO fit for further analysis.
To check the model in-sample and out-sample performance, the response variable was predicted using
the cut-off probability as 0.2 according to the traditional value used by the company for default
predictions. The ROC curves for in-sample and out-sample predictions of full logistic model are shown in
Figure 23 and Figure 24 respectively.
Figure 23: ROC Curve for in-sample Logistic model predictions Figure 24: ROC Curve for out-sample Logistic model predictions
The in-sample and out-sample performance of the full logistic model is given in Table 3 below –
Table 3: Logistic full model in and out sample performance metrics
Model performance metric AUC False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.7719 0.2038 0.3849 0.2437
Out-sample 0.7717 0.2016 0.3838 0.2425
The logistic full regression model is able to predict the defaults for training dataset with 0.2437 error rate
and with 0.2425 error rate for test dataset. Also, the AUC for ROC curves pertaining to both in and out
sample predictions is around 0.77. It can be concluded that logistic regression is a good fit for the data
and shows a considerable prediction power.
20. 20
5.1.3. Principal Component Analysis
Principal components analysis is a procedure for identifying a smaller number of uncorrelated variables,
called "principal components", from a large set of data. The goal of principal components analysis is to
explain the maximum amount of variance with the fewest number of principal components. Principal
components analysis is commonly used in the social sciences, market research, and other industries that
use large data sets.
Principal components analysis is commonly used as one step in a series of analyses and can be used to
reduce the number of variables and avoid multicollinearity. The other main advantage of PCA is that once
the patterns in the data are found, it can be compressed, i.e. by reducing the number of dimensions,
without much loss of information.
To avoid the effect of multicollinearity in the predictions, the variables were standardized and
dimensionality reduction was applied to the dataset using principal component analysis (PCA). The
variance that was observed in the direction of various components was plotted and is showed in Figure
25 below –
Figure 25: Variance distribution for Principal Components
The Principal component analysis method produced 15 principal components that could explain the data
almost as efficiently as the original variables did. However, the variance explained by the first few principal
components constitute a major amount of the total variance in the dataset. In order to choose the number
of principal components, ‘Elbow Method’ was used and the line plot for the same is shown in Figure 26
below –
21. 21
Figure 26: Elbow curve for PCA to decide the number of PCs
It can be observed that after 3rd or 4th principal component, the contribution of variance explanation is
not significant and hence we will keep only 4 principal components for further analysis of the dataset as
they account for a cumulative of more than 80% of the total variance.
The main purpose of applying PCA on the dataset was to try reducing the effect of multicollinearity and
decrease the number of dimensions so that the model performance might get better. Hence, a logistic
regression was run on the new dataset with reduced dimensions and predictions were made for the same.
It was observed that after applying PCA, the logistic model got trained more efficiently and at the same
time its predictive power also got better.
The results of the confusion matrix are given in the Table 4 below –
Table 4: Logistic model after PCA_ in and out sample performance metrics
Model performance metric False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.1341 0.2862 0.1589
Out-sample 0.1653 0.2948 0.1732
A conclusion can be made that PCA reduced the dimensions and also the effect of multicollinearity in the
model performance and hence the misclassification rate is lower as compared to the normal logistic model
with more dimensions. Although performance on hold out sample is not as good as that on the training
sample, the difference is not significant.
22. 22
5.2. Classification Tree
In a classification tree structure, each internal node denotes a test on an attribute, each branch represents
an outcome of the test, and leaf nodes represent classes. The top-most node in a tree is the root node.
CTs are applied when the response variable is qualitative or quantitative discrete. Classification trees
perform a classification of the observations on the basis of all explanatory variables and supervised by the
presence of the response variable. The segmentation process is typically carried out using only one
explanatory variable at a time. CTs are based on minimizing impurity, which refers to a measure of
variability of the response values of the observations. CTs can result in simple classification rules and can
handle the nonlinear and interactive effects of explanatory variables. But their sequential nature and
algorithmic complexity can make them depends on the observed data, and even a small change might
alter the structure of the tree. It is difficult to take a tree structure designed for one context and generalize
it for other contexts.
A classification tree model was fit on the training dataset and the results were analyzed. The classification
tree is shown in Figure 27 below –
Figure 27: Classification Tree model diagram
5.2.1. Complexity Parameter tuning and pruning –
The tree obtained with the default complexity parameter “cp” =0.01 has 5 nodes as shown in Figure 27.
However, it is necessary to tune the complexity parameter according to the error change with addition
of every node. A plot for change in relative error is shown in Figure 28, which gives the optimal value of
“cp” and hence the size of the tree.
23. 23
Figure 28: Relative error vs Complexity Parameter
As, the value of “cp” increases, complexity if the tree decreases. It can be concluded that the relative
error increases after the size of tree is more than 3 (“cp”=0.05) and hence, it is not beneficial to keep the
size of the tree more than 3. The tree shown in Figure 27 is prune using the “cp” value as 0.05 based on
the observation from Figure 26 and the final tree is shown in Figure 29 below –
Figure 29: Final Classification Tree after pruning
24. 24
The ROC curves for in-sample and out-sample predictions of Classification Tree model are shown in
Figure 30 and Figure 31 respectively.
Figure 30: ROC Curve for in-sample Classification Tree predictions Figure 31: ROC Curve for out-sample Classification Tree predictions
The in-sample and out-sample performance of the Classification Tree model is given in Table 5 below –
Table 5: Classification Tree model in and out sample performance metrics
Model performance metric AUC False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.7284 0.2693 0.3288 0.2824
Out-sample 0.7304 0.2713 0.3378 0.2862
The Classification Tree model is able to predict the defaults for training dataset with 0.2824 error rate and
with 0.2862 error rate for test dataset. Also, the AUC for ROC curves pertaining in-sample is 0.7284 and
out sample predictions is around 0.7304. It can be concluded that Classification Tree is a good fit for the
data and shows a considerable prediction power.
5.2.2. Adaptive Boosting (AdaBoost)
Boosting is a method that makes maximum use of a classifier by improving its accuracy. The classifier
method is used as a subroutine to build an extremely accurate classifier in the training set. Boosting
applies the classification system repeatedly on the training data, but in each step the learning attention is
focused on different examples of this set using adaptive weights. Once the process has finished, the single
classifiers obtained are combined into a final, highly accurate classifier in the training set. The final
classifier therefore usually achieves a high degree of accuracy in the test set, as various authors have
shown both theoretically and empirically. Out of the several versions of boosting algorithms, the best
known for binary classification problems is AdaBoost.
25. 25
It is worth to highlight that the boosting function allows quantifying the relative importance of the
predictor variables. Understanding a small individual tree can be easy. However, it is more difficult to
interpret the hundreds or thousands of trees used in the boosting ensemble. Therefore, to be able to
quantify the contribution of the predictor variables to the discrimination is a really important advantage.
The measure of importance takes into account the gain of the Gini index given by a variable in a tree and
the weight of this tree in the case of boosting.
The AdaBoost technique was applied on the dataset in this project and after hundred iterations and
adaptive weights, it output the importance of the variables in determining the binary output. The result
is shown in Figure 32 below.
Figure 32: Relative importance of each variable in the classification task
It can be observed that Boosting algorithm resulted in giving the maximum importance to the variable X6
- the repayment status in March, 2016. This is in concordance with the earlier normal tree structure and
also makes sense as the status of a credit card holder next month would depend a lot on his previous
month’s repayment status. However, not all the results of AdaBoost technique could be explained
theoretically. For the final tree model from AdaBoost, predictions were done for training as well as testing
sample and the results are delineated in Table 6 below –
Table 6: AdaBoosting Classification tree model_ in and out sample performance metrics
Model performance metric False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.0842 0.4342 0.1893
Out-sample 0.1607 0.2891 0.1678
26. 26
It can be clearly observed that the model performance has increased a lot as compared to normal
classification tree when AdaBoost technique was applied to the same dataset. As explained earlier,
because of the unique technique that this method follows, it’s performance boosts up in the testing
dataset and concordant effects are observed in the results shown in Table 6. Even though the False
Positive Rate has increased for testing dataset, the more important metric – False Negative Rate has
decreased significantly; and this has led to an overall decrease in the error rate.
5.3. Artificial Neural Network
Artificial neural networks use non-linear mathematical equations to successively develop meaningful
relationships between input and output variables through a learning process. We applied back
propagation networks to classify data. A back propagation neural network uses a feed-forward topology
and supervised learning. The structure of back propagation networks is typically composed of an input
layer, one or more hidden layers, and an output layer, each consisting of several neurons. ANNs can easily
handle the non-linear and interactive effects of explanatory variables. The major drawback of ANNs is –
they cannot result in a simple probabilistic formula of classification.
An ANN black box model was fit to the training dataset and after 500 iterations, the model converged. As
this is a black box model, the details of the model cannot be shown here. However, the in-sample and
out-sample performance of the ANN model is given in Table 7 below –
Table 7: ANN model in and out sample performance metrics
Model performance metric False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.2980 0.5191 0.3467
Out-sample 0.3499 0.4402 0.3702
The ANN model performs poorly on this dataset and especially on the hold-out sample with 0.37 error
rate.
5.4. Linear Discriminant Analysis
Discriminant analysis, also known as Fisher’s rule, is another technique applied to the binary result of
response variable. DA is an alternative to logistic regression and is based on the assumptions that, for
each given class of response variable, the explanatory variables are distributed as a multivariate normal
distribution with a common variance–covariance matrix. The objective of Fisher’s rule is to maximize the
distance between different groups and to minimize the distance within each group. The pros and cons of
DA are similar to those of LR.
Hence, assuming the underlying explanatory variables are normally distributed, discriminant analysis
model was applied on the training dataset. To check the model in-sample and out-sample performance,
the response variable was predicted using the cut-off probability as 0.2 according to the traditional value
used by the company for default predictions. The ROC curves for in-sample and out-sample predictions of
full logistic model are shown in Figure 33 and Figure 34 respectively.
27. 27
Figure 33: ROC Curve for in-sample LDA model predictions Figure 34: ROC Curve for out-sample LDA model predictions
The in-sample and out-sample performance of the full logistic model is given in 8 below –
Table 8: LDA model in and out sample performance metrics
Model performance metric AUC False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.7723 0.1383 0.4549 0.2080
Out-sample 0.7697 0.1326 0.4461 0.2030
The linear discriminant analysis model is able to predict the defaults for training dataset with 0.2080 error
rate and with 0.2030 error rate for test dataset. Also, the AUC for ROC curves pertaining to both in and
out sample predictions is around 0.77. It can be concluded that LDA is a good fit for the data and shows a
considerable prediction power.
6. MODEL COMPARISON
We have built various models on the training dataset and checked their performance on both training and
well as testing dataset in predicting defaults. For defaults, False Negatives affect the business more than
False positives. So, lesser False Negatives are desired for the predictions. To compare performance of all
the models, a cost function was introduced with 5 times as much penalty for False Negatives as compared
to that for False Positives. The comparison and model performance summary is given in Table 9 below –
28. 28
Table 9: Comparison of in and out sample metrics of all models
Model AUC FP FN Error Rate Cost
1. Logistic Regression In-sample 0.7719 0.2038 0.3849 0.2437 0.58308
Out-sample 0.7717 0.2016 0.3838 0.2425 0.58726
1.1 Logistic after PCA In-sample NA 0.1341 0.2862 0.1589 0.4164
Out-sample NA 0.1653 0.2948 0.1732 0.4327
2. Classification Tree In-sample 0.7284 0.2693 0.3288 0.2824 0.57225
Out-sample 0.7304 0.2713 0.3378 0.2862 0.58959
2.1 AdaBoost Classifier In-sample NA 0.0842 0.4342 0.1893 0.5196
Out-sample NA 0.1607 0.2891 0.1678 0.4273
3. ANN In-sample NA 0.2980 0.5191 0.3467 0.80442
Out-sample NA 0.3499 0.4402 0.3702 0.76563
4. LDA In-sample 0.7723 0.1383 0.4549 0.2080 0.66475
Out-sample 0.7697 0.1326 0.4461 0.2030 0.60377
It can be observed that although some methods (e.g. LDA) have lower misclassification rate, due to their
high False Negative rare and asymmetric cost function of the business problem, the overall cost is high.
ANN model is performing poorly in both in-sample and out-sample datasets with a very high cost value.
Logit Model and Classification Tree have almost the same cost function value. However, after applying
PCA to the dataset and using the new reduced dimensioned data for logistic regression, the results are
much better as compared to the normal logit regression. Similarly, AdaBoosting method also improves
the performance of the Classification tree by adaptive boosting technique. The error rate of logistic using
PCA is comparable to that of AdaBoosting but the False Negative rate is lesser in the training dataset,
making it better model. However, according to the business requirement, either Logistic Model (after PCA)
or AdaBoost Classifier could be used to predict the default credit card holders for the next month.
7. CONCLUSION
This exercise has enabled us to predict the customers that would likely get defaulted next month using
the past 6 months’ data for that credit card holder, viz. his payment and default history. Various classifiers
were built considering the problem statement and comparison was done based on the rates of False
Positives and False Negatives. It was observed that Logistic Regression model (after PCA) and AdaBoost
Classifier perform best amongst all the models and hence could be accepted.
It can also be said, our model may perform even better if we incorporate a few more variables for which
data is not readily available. For example, credit score, income, etc. would train the data better than what
we have now. Having said that, considering only past 6 months’ data for predicting defaults in the
immediate future is not the best way to solve the problem. Many extant data mining and machine learning
techniques are booming in the financial industry to predict credit card customer defaults.
29. 29
8. REFERENCES –
1. http://www.goodfinancialcents.com/credit-card-default-debt-consequences-results/
2. http://www.bestcreditcardrates.com
3. http://www.philly.com/philly/blogs/inq-phillydeals/US-credit-card-borrowing-surges-more-
defaults-soon.html
4. https://www.federalreserve.gov/econresdata/
5. Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007). “A Comparison of Decision Tree
Ensemble Creation Techniques.” IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(1), 173–180.
6. https://www.jstatsoft.org/index
7. http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/