This document presents a study comparing several machine learning models for personal credit scoring: logistic regression, multilayer perceptron, support vector machine, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy and was the most useful model according to a proposed measure that considers earnings from denying bad credits and losses from denying good credits. While other models had higher error rates for good credits, HLVQ-C balanced accuracy and usefulness the best, making it suitable for commercial credit scoring applications.
A Review on Credit Card Default Modelling using Data ScienceYogeshIJTSRD
In the last few years, credit card issuers have become one of the major consumer lending products in the U.S. as well as several other developed nations of the world, representing roughly 30 of total consumer lending USD 3.6 tn in 2016 . Credit cards issued by banks hold the majority of the market share with approximately 70 of the total outstanding balance. Bank’s credit card charge offs have stabilized after the financial crisis to around 3 of the outstanding total balance. However, there are still differences in the credit card charge off levels between different competitors. Harsh Nautiyal | Ayush Jyala | Dishank Bhandari "A Review on Credit Card Default Modelling using Data Science" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Special Issue | International Conference on Advances in Engineering, Science and Technology - 2021 , May 2021, URL: https://www.ijtsrd.com/papers/ijtsrd42461.pdf Paper URL : https://www.ijtsrd.com/engineering/computer-engineering/42461/a-review-on-credit-card-default-modelling-using-data-science/harsh-nautiyal
Instance Selection and Optimization of Neural NetworksITIIIndustries
Credit scoring is an important tool in financial institutions, which can be used in credit granting decision. Credit applications are marked by credit scoring models and those with high marks will be treated as “good”, while those with low marks will be regarded as “bad”. As data mining technique develops, automatic credit scoring systems are warmly welcomed for their high efficiency and objective judgments. Many machine learning algorithms have been applied in training credit scoring models, and ANN is one of them with good performance. This paper presents a higher accuracy credit scoring model based on MLP neural networks trained with back propagation algorithm. Our work focuses on enhancing credit scoring models in three aspects: optimize data distribution in datasets using a new method called Average Random Choosing; compare effects of training-validation-test instances numbers; and find the most suitable number of hidden units. Another contribution of this paper is summarizing the tendency of scoring accuracy of models when the number of hidden units increases. The experiment results show that our methods can achieve high credit scoring accuracy with imbalanced datasets. Thus, credit granting decision can be made by data mining methods using MLP neural networks.
Default Probability Prediction using Artificial Neural Networks in R ProgrammingVineet Ojha
The objective of the project is to analyze the ability of the Artificial Neural Network Model
developed to forecast the credit risk profile of retails banking loan consumers and credit card
customers.
From a theoretical point of view, this project introduces a literature review on the detailed
working and the application of Artificial Neural Networks for credit risk management.
Practically, the aim of this project is presenting a model for estimating the Probability of Default
using Artificial Neural Network to accrue benefit non-linear models.
Predicting the Credit Defaulter is a perilous task of Financial Industries like Banks. Ascertainingnon payer
before giving loan is a significant and conflict-ridden task of the Banker. Classification techniques
are the better choice for predictive analysis like finding the claimant, whether he/she is an unpretentious
customer or a cheat. Defining the outstanding classifier is a risky assignment for any industrialist like a
banker. This allow computer science researchers to drill down efficient research works through evaluating
different classifiers and finding out the best classifier for such predictive problems. This research
work investigates the productivity of LADTree Classifier and REPTree Classifier for the credit risk prediction
and compares their fitness through various measures. German credit dataset has been taken and used
to predict the credit risk with a help of open source machine learning tool.
Machine Learning Project - Default credit card clients Vatsal N Shah
- The model we built here will use all possible factors to predict data on customers to find who are defaulters and non‐defaulters next month.
- The goal is to find the whether the clients are able to pay their next month credit amount.
- Identify some potential customers for the bank who can settle their credit balance.
- To determine if their customers could make the credit card payments on‐time.
- Default is the failure to pay interest or principal on a loan or credit card payment.
A Review on Credit Card Default Modelling using Data ScienceYogeshIJTSRD
In the last few years, credit card issuers have become one of the major consumer lending products in the U.S. as well as several other developed nations of the world, representing roughly 30 of total consumer lending USD 3.6 tn in 2016 . Credit cards issued by banks hold the majority of the market share with approximately 70 of the total outstanding balance. Bank’s credit card charge offs have stabilized after the financial crisis to around 3 of the outstanding total balance. However, there are still differences in the credit card charge off levels between different competitors. Harsh Nautiyal | Ayush Jyala | Dishank Bhandari "A Review on Credit Card Default Modelling using Data Science" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Special Issue | International Conference on Advances in Engineering, Science and Technology - 2021 , May 2021, URL: https://www.ijtsrd.com/papers/ijtsrd42461.pdf Paper URL : https://www.ijtsrd.com/engineering/computer-engineering/42461/a-review-on-credit-card-default-modelling-using-data-science/harsh-nautiyal
Instance Selection and Optimization of Neural NetworksITIIIndustries
Credit scoring is an important tool in financial institutions, which can be used in credit granting decision. Credit applications are marked by credit scoring models and those with high marks will be treated as “good”, while those with low marks will be regarded as “bad”. As data mining technique develops, automatic credit scoring systems are warmly welcomed for their high efficiency and objective judgments. Many machine learning algorithms have been applied in training credit scoring models, and ANN is one of them with good performance. This paper presents a higher accuracy credit scoring model based on MLP neural networks trained with back propagation algorithm. Our work focuses on enhancing credit scoring models in three aspects: optimize data distribution in datasets using a new method called Average Random Choosing; compare effects of training-validation-test instances numbers; and find the most suitable number of hidden units. Another contribution of this paper is summarizing the tendency of scoring accuracy of models when the number of hidden units increases. The experiment results show that our methods can achieve high credit scoring accuracy with imbalanced datasets. Thus, credit granting decision can be made by data mining methods using MLP neural networks.
Default Probability Prediction using Artificial Neural Networks in R ProgrammingVineet Ojha
The objective of the project is to analyze the ability of the Artificial Neural Network Model
developed to forecast the credit risk profile of retails banking loan consumers and credit card
customers.
From a theoretical point of view, this project introduces a literature review on the detailed
working and the application of Artificial Neural Networks for credit risk management.
Practically, the aim of this project is presenting a model for estimating the Probability of Default
using Artificial Neural Network to accrue benefit non-linear models.
Predicting the Credit Defaulter is a perilous task of Financial Industries like Banks. Ascertainingnon payer
before giving loan is a significant and conflict-ridden task of the Banker. Classification techniques
are the better choice for predictive analysis like finding the claimant, whether he/she is an unpretentious
customer or a cheat. Defining the outstanding classifier is a risky assignment for any industrialist like a
banker. This allow computer science researchers to drill down efficient research works through evaluating
different classifiers and finding out the best classifier for such predictive problems. This research
work investigates the productivity of LADTree Classifier and REPTree Classifier for the credit risk prediction
and compares their fitness through various measures. German credit dataset has been taken and used
to predict the credit risk with a help of open source machine learning tool.
Machine Learning Project - Default credit card clients Vatsal N Shah
- The model we built here will use all possible factors to predict data on customers to find who are defaulters and non‐defaulters next month.
- The goal is to find the whether the clients are able to pay their next month credit amount.
- Identify some potential customers for the bank who can settle their credit balance.
- To determine if their customers could make the credit card payments on‐time.
- Default is the failure to pay interest or principal on a loan or credit card payment.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal,
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Melissa Moody
Researchers Navin Kasa, Andrew Dahbura, and Charishma Ravoori undertook a capstone project—part of the UVA Data Science Institute Master of Science in Data Science program—that addresses credit card fraud detection through a semi-supervised approach, in which clusters of account profiles are created and used for modeling classifiers.
Billions of dollars of loss are caused every year by fraudulent credit card transactions. The design of efficient fraud detection algorithms is key for reducing these losses, and more and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is however particularly challenging due to the non-stationary distribution of the data, the highly unbalanced classes distributions and the availability of few transactions labeled by fraud investigators. At the same time public data are scarcely available for confidentiality issues, leaving unanswered many questions about what is the best strategy. In this thesis we aim to provide some answers by focusing on crucial issues such as: i) why and how under sampling is useful in the presence of class imbalance (i.e. frauds are a small percentage of the transactions), ii) how to deal with unbalanced and evolving data streams (non-stationarity due to fraud evolution and change of spending behavior), iii) how to assess performances in a way which is relevant for detection and iv) how to use feedbacks provided by investigators on the fraud alerts generated. Finally, we design and assess a prototype of a Fraud Detection System able to meet real-world working conditions and that is able to integrate investigators’ feedback to generate accurate alerts.
• Forecasted the Expected Credit Loss, over the lifetime of the mortgage. Built Loan-level PD Model using Markov Chain Transition Matrix and logistic regression with six transition states and validated them using backtesting.
The use of genetic algorithm, clustering and feature selection techniques in ...IJMIT JOURNAL
Decision tree modelling, as one of data mining techniques, is used for credit scoring of bank customers.
The main problem is the construction of decision trees that could classify customers optimally. This study
presents a new hybrid mining approach in the design of an effective and appropriate credit scoring model.
It is based on genetic algorithm for credit scoring of bank customers in order to offer credit facilities to
each class of customers. Genetic algorithm can help banks in credit scoring of customers by selecting
appropriate features and building optimum decision trees. The new proposed hybrid classification model is
established based on a combination of clustering, feature selection, decision trees, and genetic algorithm
techniques. We used clustering and feature selection techniques to pre-process the input samples to
construct the decision trees in the credit scoring model. The proposed hybrid model choices and combines
the best decision trees based on the optimality criteria. It constructs the final decision tree for credit
scoring of customers. Using one credit dataset, results confirm that the classification accuracy of the
proposed hybrid classification model is more than almost the entire classification models that have been
compared in this paper. Furthermore, the number of leaves and the size of the constructed decision tree
(i.e. complexity) are less, compared with other decision tree models. In this work, one financial dataset was
chosen for experiments, including Bank Mellat credit dataset.
Credit Card Fraudulent Transaction Detection Research PaperGarvit Burad
Credit Card Fraudulent Transaction Detection Research Paper using Machine Learning technologies like Logistic Regression, Random Forrest, Feature Engineering and various techniques to deal with highly skewed dataset
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal,
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Melissa Moody
Researchers Navin Kasa, Andrew Dahbura, and Charishma Ravoori undertook a capstone project—part of the UVA Data Science Institute Master of Science in Data Science program—that addresses credit card fraud detection through a semi-supervised approach, in which clusters of account profiles are created and used for modeling classifiers.
Billions of dollars of loss are caused every year by fraudulent credit card transactions. The design of efficient fraud detection algorithms is key for reducing these losses, and more and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is however particularly challenging due to the non-stationary distribution of the data, the highly unbalanced classes distributions and the availability of few transactions labeled by fraud investigators. At the same time public data are scarcely available for confidentiality issues, leaving unanswered many questions about what is the best strategy. In this thesis we aim to provide some answers by focusing on crucial issues such as: i) why and how under sampling is useful in the presence of class imbalance (i.e. frauds are a small percentage of the transactions), ii) how to deal with unbalanced and evolving data streams (non-stationarity due to fraud evolution and change of spending behavior), iii) how to assess performances in a way which is relevant for detection and iv) how to use feedbacks provided by investigators on the fraud alerts generated. Finally, we design and assess a prototype of a Fraud Detection System able to meet real-world working conditions and that is able to integrate investigators’ feedback to generate accurate alerts.
• Forecasted the Expected Credit Loss, over the lifetime of the mortgage. Built Loan-level PD Model using Markov Chain Transition Matrix and logistic regression with six transition states and validated them using backtesting.
The use of genetic algorithm, clustering and feature selection techniques in ...IJMIT JOURNAL
Decision tree modelling, as one of data mining techniques, is used for credit scoring of bank customers.
The main problem is the construction of decision trees that could classify customers optimally. This study
presents a new hybrid mining approach in the design of an effective and appropriate credit scoring model.
It is based on genetic algorithm for credit scoring of bank customers in order to offer credit facilities to
each class of customers. Genetic algorithm can help banks in credit scoring of customers by selecting
appropriate features and building optimum decision trees. The new proposed hybrid classification model is
established based on a combination of clustering, feature selection, decision trees, and genetic algorithm
techniques. We used clustering and feature selection techniques to pre-process the input samples to
construct the decision trees in the credit scoring model. The proposed hybrid model choices and combines
the best decision trees based on the optimality criteria. It constructs the final decision tree for credit
scoring of customers. Using one credit dataset, results confirm that the classification accuracy of the
proposed hybrid classification model is more than almost the entire classification models that have been
compared in this paper. Furthermore, the number of leaves and the size of the constructed decision tree
(i.e. complexity) are less, compared with other decision tree models. In this work, one financial dataset was
chosen for experiments, including Bank Mellat credit dataset.
Credit Card Fraudulent Transaction Detection Research PaperGarvit Burad
Credit Card Fraudulent Transaction Detection Research Paper using Machine Learning technologies like Logistic Regression, Random Forrest, Feature Engineering and various techniques to deal with highly skewed dataset
We propose an algorithm for training Multi Layer Preceptrons for classification problems, that we named Hidden Layer Learning Vector Quantization (H-LVQ). It consists of applying Learning Vector Quantization to the last hidden layer of a MLP and it gave very successful results on problems containing a large number of correlated inputs. It was applied with excellent results on classification of Rurtherford
backscattering spectra and on a benchmark problem of image recognition. It may also be used for efficient feature extraction.
Synthetic feature generation to improve accuracy in prediction of credit limitsBOHRInternationalJou1
Financial institutions use various data mining algorithms to determine the credit limits for individuals using features
like age, education, employment, gender, income, and marital status. But, there is still a question of accurate
predictability, that is, how accurate can an institution be in predicting risk and granting credit levels. If an institution
grants too low of a credit limit/loan for an individual, then the institution may lose business to competitors, but
if the institution grants too high of a credit limit/loan, then the institution may lose money if that individual does
not repay the credit/loan. The novelty of this work is that it shows how to improve the accuracy in predicting
credit limits/loan amounts using synthetic feature generation. By creating secondary groupings and including
both the original binning and the synthetic bins, the classification accuracy and other statistical measures like
precision and ROC improved substantially. Hence, our research showed that without synthetic feature generation,
the classification rates were low, and the use of synthetic features greatly improved the classification accuracy and
other statistical measures.
# Project 03 - Data Mining on Financial Data
In this project, various models like Logistic Regression, KNN, SVM, and Random Forest have been applied to three finance-related datasets in order to discover the insights from the datasets. The methods will be applied to the datasets in R studio and corresponding outputs will be shown in this paper. Then the applied methods will be compared with each other to identify how well the method is performing on each of the datasets and then finally the better method will be chosen for each of the datasets. The methods were explained in detail along with their advantages with the help of the information from the relevant papers. Every method which has been applied to the datasets has been confirmed whether it follows the data mining methodologies. Data mining methodologies like CRISP-DM, KDD and SEMMA will also be explained in detail via this paper. The process flow of the data mining methodologies will be explained and also will be made sure that whether the process flow has been followed while applying each method on the datasets. Data mining is now considered as a major factor in the risk management process of financial institutions. Even though various data mining tools are existing in the market, this paper allows readers to understand how the algorithm works on the dataset and how to justify whether an algorithm’s prediction.
https://ijitce.com/index.php
Our journal maintains rigorous peer review standards. Each submitted article undergoes a thorough evaluation by experts in the respective field. This stringent review process helps ensure that only high-quality and scientifically sound research is accepted for publication. Researchers can trust that the articles they find in IJITCE have been critically assessed for validity, significance, and originality.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Credit risks are calculated based on the borrowers’ overall ability to repay. Our objective was to use optimization in order to create a tool that approves or rejects loans to borrowers. We also used optimization to establish how much interest rate/credit will be extended to borrowers who were approved for a loan.
In this paper I compare a conventional classification regression method with the state of the art machine learning technique XGBoost. This results in a major performance gain in terms of classification and expected loss.
Predicting online user behaviour using deep learning algorithmsArmando Vieira
We propose a robust classifier to predict buying intentions based on user behaviour within a large e-commerce website. In this work we compare traditional machine learning techniques with the most advanced deep learning approaches. We show that both Deep Belief Networks and Stacked Denoising auto-Encoders achieved a substantial improvement by extracting features from high dimensional data during the pre-train phase. They prove also to be more convenient to deal with severe class imbalance.
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
Presentation at University of Lisbon on Machine Learning and big data.
Deep learning algorithms and applications to credit risk analysis, churn detection and recommendation algorithms
1. Improving Personal Credit Scoring with HLVQ-C
A. S. Vieira1
, João Duarte1
, B. Ribeiro2
and J. C. Neves3
1
ISEP, Rua de S. Tomé, 4200 Porto, Portugal
2
Department of Informatics Engineering, University of Coimbra, P-3030-290
Coimbra, Portugal
3
ISEG - School of Economics, Rua Miguel Lupi 20, 1249-078 Lisboa, Portugal
{asv@isep.ipp.pt, bribeiro@dei.uc.pt, jcneves@iseg.utl.pt}
Abstract. In this paper we study personal credit scoring using several machine
learning algorithms: Multilayer Perceptron, Logistic Regression, Support
Vector Machines, AddaboostM1 and Hidden Layer Learning Vector
Quantization. The scoring models were tested on a large dataset from a
Portuguese bank. Results are benchmarked against traditional methods under
consideration for commercial applications. A measure of the usefulness of a
scoring model is presented and we show that HLVQ-C is the most accurate
model.
Keywords: Credit Scoring, Neural Networks, Classification, Hidden Layer
Learning Vector Quantization.
1 Introduction
Quantitative credit scoring models have been developed for the credit granting
decision in order to classify applications as ‘good’ or `bad’, the latest being loosely
defined as a group with a high likelihood of defaulting on the financial obligation.
It is very important to have accurate models to identify bad performers. Even a
small fraction increase in credit scoring accuracy is important. Linear discriminant
analysis still is the model traditionally used for credit scoring. However, with the
growth of the credit industry and the large loan portfolios under management, more
accurate credit scoring models are being actively investigated [1]. This effort is
mainly oriented towards nonparametric statistical methods, classification trees, and
neural network technology for credit scoring applications [1-5].
The purpose of this work is to investigate the accuracy of several machine learning
models for the credit scoring applications and to benchmark their performance against
the models currently under investigation.
The credit industry has experienced a rapid growth with significant increases in
instalment credit, single-family mortgages, auto-financing, and credit card debt.
Credit scoring models, i.e, rating of the client ability to pay the loans, are widely used
by the financial industry to improve cashflow and credit collections. The advantages
of credit scoring include reducing the cost of credit analysis, enabling faster credit
decisions, closer monitoring of existing accounts, and prioritizing collections [4].
2. Personal credit scoring is used by banks for approval of home loans, to set credit
limits on credit cards and for other personal expenses. However, with the growth in
financial services there have been mounting losses from delinquent loans. For
instance, the recent crises in the financial system triggered by sub-prime mortgages
have caused losses of several billion dollars.
In response, many organizations in the credit industry are developing new models
to support the personal credit decision. The objective of these new credit scoring
models is increasing accuracy, which means more creditworthy applicants are granted
credit thereby increasing profits; non-creditworthy applicants are denied credit thus
decreasing losses.
The main research focuses on two areas: prediction of firm insolvency and
prediction of individual credit risk. However, due to the proprietary nature of credit
scoring, there is few research reporting the performance of commercial credit scoring
applications.
Salchenberger et al. investigate the use of a multilayer perceptron neural network
to predict the financial health of savings and loans [6]. The authors compare a
multilayer perceptron neural network with a logistic regression model for a data set of
3429 S&L's from January 1986 to December 1987. They find that the neural network
model performs as well as or better than the logistic regression model for each data
set examined.
The use of decision trees and multilayer perceptrons neural network for personal
credit scoring were studied by several authors. West tested several neural networks
architectures on two personal credit datasets, German and Australian. Results
indicates that multilayer perceptron neural network and the decision tree model both
have a comparable level of accuracy while being only marginally superior to tradition
parametric methods [7].
Jensen [5] develops a multilayer perceptron neural network for credit scoring with
three outcomes: obligation charged of (11.2%), obligation delinquent (9.6%), and
obligation paid-of. Jensen reports a correct classification result of 76 - 80% with a
false positive rate (bad credit risk classified as good credit) of 16% and a false
negative rate (good credit risk classified as bad credit) of 4%. Jensen concludes that
the neural network has potential for credit scoring applications, but its results were
obtained on only 50 examples.
The research available on predicting financial distress, whether conducted at the
firm or individual level suggests that recent non-parametric models show potential yet
lack an overwhelming advantage over classical statistical techniques. Recently we
have successfully applied new data mining models like Hidden Layer Learning
Vector Quantization (HLVQ-C) [8] and Support Vector Machines (SVM) [9] for
bankruptcy prediction where they clear outperformed linear methods. However, the
major drawback for using these models is that they are difficult to understand and the
decisions cannot be explicitly discriminated.
This paper is organized as follows. Section 2 discusses the dataset used, the pre-
processing of the data and feature selection. Section 3 presents the models and the
usefulness measure. In Section 4 the results are discussed and finally section 5
presents the conclusions.
3. 2 Dataset
The database contains about 400 000 entries of costumers who have solicited a
personal credit to the bank. The valued solicited ranges from 5 to 40 kEuros and the
payment period varies between 12 to 72 months.
Table 1 presents the definitions of the eighteen attributes used by the bank. Eight
of these attributes are categorical (1, 2, 3, 4, 5, 8, 9 and 10) and the remaining
continuous. Most of the entries in the database have missing values for several
attributes. To create a useful training set we select only entries without missing
values.
The database also contains the number of days that each client is in default to the
bank concerning the payment of the monthly mortgage – in most cases this number is
zero. We consider a client with bad credit when this number is greater than 30 days.
We found 953 examples in the database within this category. To create a balanced
dataset an equal number of randomly selected non-default examples were selected,
reaching a total of 1906 training cases. We call this dataset 1.
We also created a second dataset where the definition of bad credit was set to 45
days of delay. This dataset is therefore more unbalanced containing 18% of defaults
and 82% non-defaults. This is called dataset 2.
Table 1: Attributes used for credit scoring. Marked bold are the selected attributes.
# Designation # Designation
1 Professional activity 10 Nationality
2 Previous professional activity 11 Debt capacity
3 Zip code 12 Annual costs
4 Zip code – first two digits 13 Total income
5 Marital status 14 Other income
6 Age 15 Effort ratio
7 Number of dependents 16 Future effort ratio
8 Have home phone 17 Number of instalments
9 Residential type 18 Loan solicited
2.1 Feature selection
Several feature selection algorithms were used to exclude useless attributes and
reduce the complexity of the classifier. Due to the presence of many categorical
attributes, feature selection is difficult. Several methods were used to test the
consistency of the selection: SVM Attribute Evaluation, Chisquared and GainRatio.
Each method selected slightly different sets of attributes. We choose the following set
of six attributes with the highest consensus among all rankers: 1, 3, 4, 11, 17 and 18.
4. 3. Models used
The data was analysed with five machine learning algorithms: Logistic, Multilayer
Perceptron (MLP), Support Vector Machine (SVM), AdaBoostM1 and Hidden Layer
Learning Vector Quantization (HLVQ-C).
For MLP, we used a neural network with a single hidden layer with 4 neurons. The
learning rate was set to 0.3 and the momentum to 0.2. The SVM algorithm used was
the LibSVM [12] library with a radial basis function as kernel with the cost parameter
C = 1 and the shrinking heuristic. For AdaBoostM1 algorithm we used a Decision
Stump as weak-learner and set the number of iterations to 100. No resampling was
used. The HLVQ-C algorithm implementation is described elsewhere [8].
3.1 Usefulness of a classifier
Accuracy is a good indicator, but not the only criteria, to choose the apropriate
classifier. We introduce a measure of the usefulness of a classifier, defined by:
LE −=η ,
where E is the earnings obtained by the use of the classifier and L the losses incurred
due to the inevitable misclassifications.
Earning, for the bank point of view, results from refusing credit to defaults clients,
and can be expressed as:
xeNVE I )1( −=
where N is the number of loans applicants, V the average value of a loan, Ie the type
I error and x the typical percentage of defaults in the real sample. For simplicity we
are assuming a Loss Given Default (LGD) of 100%.
Losses results from excluding clients that were incorrectly classified as defaults. In
a simplified way they can be calculated as:
IIexmNVL )1( −=
where m is the average margin typically obtained by the bank in a loan. The net gain
in using a classifier is:
[ ]mexexNV III )1()1( −−−=η .
To have 0>η we need:
mG
x
x
>
−1
,
5. where
I
II
e
e
G
−
=
1
, is a measure of the efficiency of the classifier. This quantity
should be the lowest possible. Assuming x small and Ie = 0.5 , we should have
IImex 2> .
4. Results
In table 2 we compare the efficiency of the classifiers on two datasets using 10-fold
cross validation. For dataset 1, most classifiers achieve a good accuracy in detecting
defaults but at the cost of large type II errors. Since real data is highly unbalanced,
most cases being non-defaults, this means that more than half of clients will be
rejected. SVM is the most balanced classifier while HLVQ-C achieved the highest
accuracy on both datasets.
Since dataset 2 is more unbalanced and the default definition more strict error type
II decreased considerably while error type I increased. More important, the usefulness
of the classifier, measured by G, improved substantially. The HLVQ-C is again the
best performer, either on accuracy and usefulness, and AdaboostM1 the second best.
Logistic is the worst performer.
Following our definition, for the classifier to be useful the dataset has to have
about 6% defaults, considering the best model (HLVQ-C), and as much as 11% for
the Logistic case (setting m = 0.5).
To increase the usefulness, i.e. lower G, error type II should decrease without
deteriorating error type I. This can be done either by using a more unbalanced dataset
or applying different weights for each class. The exact proportion of instances in each
class in the dataset can be adjusted in order to minimize G.
Table 2. Accuracy, error types and usefulness of different models in the two datasets
considered.
Classifier Accuracy Type I Type II G
Logistic 66.3 27.3 40.1 54.8
MLP 67.5 8.1 57.1 61.1
Dataset 1 SVM 64.9 35.6 34.6 52.3
AdaboostM1 69.0 12.6 49.4 55.7
HLVQ-C 72.6 5.3 49.5 52.3
Logistic 81.2 48.2 11.0 21.2
MLP 82.3 57.4 9.1 20.1
Dataset 2 SVM 83.3 38.1 12.4 19.3
AdaboostM1 84.1 45.7 8.0 14.7
HLVQ-C 86.5 48.3 6.2 11.9
6. 5 Conclusions
In this work we compared the efficiency of several machine learning algorithms for
credit scoring. Feature selection was used to reduce the complexity and eliminate
useless attributes. From the initial set of 18 features only 6 have been selected.
While MLP slightly improves the accuracy of Logistic regression, other methods
show considerable gains. AdaboostM1 boosts its accuracy by 3% and HLVQ-C up to
5%.
The price to be paid for the accurate detection of defaults is a high rate of false
positives. To circumvent this situation an unbalanced dataset was used with a more
strict definition of default. A measure of the usefulness of the classifier was
introduced and we showed that it improves considerably on this second dataset.
References
1. Brill J. The importance of credit scoring models in improving cashflow and collections.
Business Credit 7, 1 (1998)
2. Tam KY, Kiang MY. Managerial applications of neural networks: the case of bank failure
predictions. Management Science 47, 926 (1992).
3. Davis RH, Edelman DB, Gammerman AJ. Machine learning algorithms for credit-card
applications. IMA Journal of Mathematics Applied in Business and Industry 51, 43 (1992).
4. Desai VS, Crook JN, Overstreet GA. A comparison of neural networks and linear scoring
models in the credit union environment. European Journal of Operational Research 37, 24
(1996).
5. Jensen HL. Using neural networks for credit scoring. Managerial Finance 26, 18 (1992).
6. Salchenberger LM, Cinar EM, Lash NA. Neural networks: a new tool for predicting thrift
failures. Decision Sciences 23, 899 (1992).
7. West D, Neural Network credit scoring models, Computers & Operations Research 27, 1131
(2000).
8. A. S. Vieira and J. C. Neves, “Improving Bankruptcy Prediction with Hidden Layer Learning
Vector Quantization”, Eur. Account. Rev. 15 (2), 253-275. (2006)
9. B. Ribeiro, A. Vieira and J. C. Neves: Sparse Bayesian Models: Bankruptcy-Predictors of
Choice?, Int. J. Conf. Neural Networks, Vancouver, Canada 3377-3381 (2006).