This document proposes a new approach to reduce computational time for credit scoring models based on support vector machines (SVM) and stratified sampling. The approach uses SVM incorporated with feature selection using F-scores and takes a sample of the dataset instead of the whole dataset to create the credit scoring model. The new method is shown to have competitive accuracy to other methods while significantly reducing computational time by optimizing features and reducing dataset size.
A Review on Credit Card Default Modelling using Data ScienceYogeshIJTSRD
In the last few years, credit card issuers have become one of the major consumer lending products in the U.S. as well as several other developed nations of the world, representing roughly 30 of total consumer lending USD 3.6 tn in 2016 . Credit cards issued by banks hold the majority of the market share with approximately 70 of the total outstanding balance. Bank’s credit card charge offs have stabilized after the financial crisis to around 3 of the outstanding total balance. However, there are still differences in the credit card charge off levels between different competitors. Harsh Nautiyal | Ayush Jyala | Dishank Bhandari "A Review on Credit Card Default Modelling using Data Science" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Special Issue | International Conference on Advances in Engineering, Science and Technology - 2021 , May 2021, URL: https://www.ijtsrd.com/papers/ijtsrd42461.pdf Paper URL : https://www.ijtsrd.com/engineering/computer-engineering/42461/a-review-on-credit-card-default-modelling-using-data-science/harsh-nautiyal
The use of genetic algorithm, clustering and feature selection techniques in ...IJMIT JOURNAL
Decision tree modelling, as one of data mining techniques, is used for credit scoring of bank customers.
The main problem is the construction of decision trees that could classify customers optimally. This study
presents a new hybrid mining approach in the design of an effective and appropriate credit scoring model.
It is based on genetic algorithm for credit scoring of bank customers in order to offer credit facilities to
each class of customers. Genetic algorithm can help banks in credit scoring of customers by selecting
appropriate features and building optimum decision trees. The new proposed hybrid classification model is
established based on a combination of clustering, feature selection, decision trees, and genetic algorithm
techniques. We used clustering and feature selection techniques to pre-process the input samples to
construct the decision trees in the credit scoring model. The proposed hybrid model choices and combines
the best decision trees based on the optimality criteria. It constructs the final decision tree for credit
scoring of customers. Using one credit dataset, results confirm that the classification accuracy of the
proposed hybrid classification model is more than almost the entire classification models that have been
compared in this paper. Furthermore, the number of leaves and the size of the constructed decision tree
(i.e. complexity) are less, compared with other decision tree models. In this work, one financial dataset was
chosen for experiments, including Bank Mellat credit dataset.
Default Probability Prediction using Artificial Neural Networks in R ProgrammingVineet Ojha
The objective of the project is to analyze the ability of the Artificial Neural Network Model
developed to forecast the credit risk profile of retails banking loan consumers and credit card
customers.
From a theoretical point of view, this project introduces a literature review on the detailed
working and the application of Artificial Neural Networks for credit risk management.
Practically, the aim of this project is presenting a model for estimating the Probability of Default
using Artificial Neural Network to accrue benefit non-linear models.
A potential objective of every financial organization is to retain existing customers and attain new
prospective customers for long-term. The economic behaviour of customer and the nature of the
organization are controlled by a prescribed form called Know Your Customer (KYC) in manual banking.
Depositor customers in some sectors (business of Jewellery/Gold, Arms, Money exchanger etc) are with
high risk; whereas in some sectors (Transport Operators, Auto-delear, religious) are with medium risk;
and in remaining sectors (Retail, Corporate, Service, Farmer etc) belongs to low risk. Presently, credit risk
for counterparty can be broadly categorized under quantitative and qualitative factors. Although there are
many existing systems on customer retention as well as customer attrition systems in bank, these rigorous
methods suffers clear and defined approach to disburse loan in business sector. In the paper, we have used
records of business customers of a retail commercial bank in the city including rural and urban area of
(Tangail city) Bangladesh to analyse the major transactional determinants of customers and predicting of a
model for prospective sectors in retail bank. To achieve this, data mining approach is adopted for
analysing the challenging issues, where pruned decision tree classification technique has been used to
develop the model and finally tested its performance with Weka result. Moreover, this paper attempts to
build up a model to predict prospective business sectors in retail banking.
Decision support system using decision tree and neural networksAlexander Decker
The document discusses a decision support system that uses a hybrid of decision tree and neural network algorithms. The system was developed to handle loan granting decisions and clinical decisions for eye disease diagnosis. It uses decision trees to segment customers/diseases into clusters and then feeds the rules into a neural network to better predict the risk of loan defaults or presence of eye diseases. The system was analyzed and designed using object-oriented methods and implemented in C programming language with a MATLAB engine. It achieved an 88% success rate in evaluations.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET Journal
This document discusses machine learning models for predicting credit risk in bank loan applications. It begins with an introduction to credit risk assessment and types of loans. Then, it describes how machine learning can be used to more accurately evaluate borrowers' ability to repay loans based on important variables like borrower characteristics, loan details, and repayment status. The document proposes using artificial neural network and support vector machine models to classify borrowers as good or bad credit risks based on these variables. It evaluates the accuracy of support vector machines and boosted decision tree models for the task of credit risk prediction.
The Use of Genetic Algorithm, Clustering and Feature Selection Techniques in ...IJMIT JOURNAL
Decision tree modelling, as one of data mining techniques, is used for credit scoring of bank customers. The main problem is the construction of decision trees that could classify customers optimally. This study presents a new hybrid mining approach in the design of an effective and appropriate credit scoring model.
It is based on genetic algorithm for credit scoring of bank customers in order to offer credit facilities to each class of customers. Genetic algorithm can help banks in credit scoring of customers by selecting appropriate features and building optimum decision trees. The new proposed hybrid classification model is established based on a combination of clustering, feature selection, decision trees, and genetic algorithm
techniques. We used clustering and feature selection techniques to pre-process the input samples to construct the decision trees in the credit scoring model. The proposed hybrid model choices and combines the best decision trees based on the optimality criteria. It constructs the final decision tree for credit scoring of customers. Using one credit dataset, results confirm that the classification accuracy of the proposed hybrid classification model is more than almost the entire classification models that have been compared in this paper. Furthermore, the number of leaves and the size of the constructed decision tree (i.e. complexity) are less, compared with other decision tree models. In this work, one financial dataset was chosen for experiments, including Bank Mellat credit dataset.
A Review on Credit Card Default Modelling using Data ScienceYogeshIJTSRD
In the last few years, credit card issuers have become one of the major consumer lending products in the U.S. as well as several other developed nations of the world, representing roughly 30 of total consumer lending USD 3.6 tn in 2016 . Credit cards issued by banks hold the majority of the market share with approximately 70 of the total outstanding balance. Bank’s credit card charge offs have stabilized after the financial crisis to around 3 of the outstanding total balance. However, there are still differences in the credit card charge off levels between different competitors. Harsh Nautiyal | Ayush Jyala | Dishank Bhandari "A Review on Credit Card Default Modelling using Data Science" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Special Issue | International Conference on Advances in Engineering, Science and Technology - 2021 , May 2021, URL: https://www.ijtsrd.com/papers/ijtsrd42461.pdf Paper URL : https://www.ijtsrd.com/engineering/computer-engineering/42461/a-review-on-credit-card-default-modelling-using-data-science/harsh-nautiyal
The use of genetic algorithm, clustering and feature selection techniques in ...IJMIT JOURNAL
Decision tree modelling, as one of data mining techniques, is used for credit scoring of bank customers.
The main problem is the construction of decision trees that could classify customers optimally. This study
presents a new hybrid mining approach in the design of an effective and appropriate credit scoring model.
It is based on genetic algorithm for credit scoring of bank customers in order to offer credit facilities to
each class of customers. Genetic algorithm can help banks in credit scoring of customers by selecting
appropriate features and building optimum decision trees. The new proposed hybrid classification model is
established based on a combination of clustering, feature selection, decision trees, and genetic algorithm
techniques. We used clustering and feature selection techniques to pre-process the input samples to
construct the decision trees in the credit scoring model. The proposed hybrid model choices and combines
the best decision trees based on the optimality criteria. It constructs the final decision tree for credit
scoring of customers. Using one credit dataset, results confirm that the classification accuracy of the
proposed hybrid classification model is more than almost the entire classification models that have been
compared in this paper. Furthermore, the number of leaves and the size of the constructed decision tree
(i.e. complexity) are less, compared with other decision tree models. In this work, one financial dataset was
chosen for experiments, including Bank Mellat credit dataset.
Default Probability Prediction using Artificial Neural Networks in R ProgrammingVineet Ojha
The objective of the project is to analyze the ability of the Artificial Neural Network Model
developed to forecast the credit risk profile of retails banking loan consumers and credit card
customers.
From a theoretical point of view, this project introduces a literature review on the detailed
working and the application of Artificial Neural Networks for credit risk management.
Practically, the aim of this project is presenting a model for estimating the Probability of Default
using Artificial Neural Network to accrue benefit non-linear models.
A potential objective of every financial organization is to retain existing customers and attain new
prospective customers for long-term. The economic behaviour of customer and the nature of the
organization are controlled by a prescribed form called Know Your Customer (KYC) in manual banking.
Depositor customers in some sectors (business of Jewellery/Gold, Arms, Money exchanger etc) are with
high risk; whereas in some sectors (Transport Operators, Auto-delear, religious) are with medium risk;
and in remaining sectors (Retail, Corporate, Service, Farmer etc) belongs to low risk. Presently, credit risk
for counterparty can be broadly categorized under quantitative and qualitative factors. Although there are
many existing systems on customer retention as well as customer attrition systems in bank, these rigorous
methods suffers clear and defined approach to disburse loan in business sector. In the paper, we have used
records of business customers of a retail commercial bank in the city including rural and urban area of
(Tangail city) Bangladesh to analyse the major transactional determinants of customers and predicting of a
model for prospective sectors in retail bank. To achieve this, data mining approach is adopted for
analysing the challenging issues, where pruned decision tree classification technique has been used to
develop the model and finally tested its performance with Weka result. Moreover, this paper attempts to
build up a model to predict prospective business sectors in retail banking.
Decision support system using decision tree and neural networksAlexander Decker
The document discusses a decision support system that uses a hybrid of decision tree and neural network algorithms. The system was developed to handle loan granting decisions and clinical decisions for eye disease diagnosis. It uses decision trees to segment customers/diseases into clusters and then feeds the rules into a neural network to better predict the risk of loan defaults or presence of eye diseases. The system was analyzed and designed using object-oriented methods and implemented in C programming language with a MATLAB engine. It achieved an 88% success rate in evaluations.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET Journal
This document discusses machine learning models for predicting credit risk in bank loan applications. It begins with an introduction to credit risk assessment and types of loans. Then, it describes how machine learning can be used to more accurately evaluate borrowers' ability to repay loans based on important variables like borrower characteristics, loan details, and repayment status. The document proposes using artificial neural network and support vector machine models to classify borrowers as good or bad credit risks based on these variables. It evaluates the accuracy of support vector machines and boosted decision tree models for the task of credit risk prediction.
The Use of Genetic Algorithm, Clustering and Feature Selection Techniques in ...IJMIT JOURNAL
Decision tree modelling, as one of data mining techniques, is used for credit scoring of bank customers. The main problem is the construction of decision trees that could classify customers optimally. This study presents a new hybrid mining approach in the design of an effective and appropriate credit scoring model.
It is based on genetic algorithm for credit scoring of bank customers in order to offer credit facilities to each class of customers. Genetic algorithm can help banks in credit scoring of customers by selecting appropriate features and building optimum decision trees. The new proposed hybrid classification model is established based on a combination of clustering, feature selection, decision trees, and genetic algorithm
techniques. We used clustering and feature selection techniques to pre-process the input samples to construct the decision trees in the credit scoring model. The proposed hybrid model choices and combines the best decision trees based on the optimality criteria. It constructs the final decision tree for credit scoring of customers. Using one credit dataset, results confirm that the classification accuracy of the proposed hybrid classification model is more than almost the entire classification models that have been compared in this paper. Furthermore, the number of leaves and the size of the constructed decision tree (i.e. complexity) are less, compared with other decision tree models. In this work, one financial dataset was chosen for experiments, including Bank Mellat credit dataset.
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING mlaij
Nowadays, There are many risks related to bank loans, for the bank and for those who get the loans. The
analysis of risk in bank loans need understanding what is the meaning of risk. In addition, the number of
transactions in banking sector is rapidly growing and huge data volumes are available which represent
the customers behavior and the risks around loan are increased. Data Mining is one of the most motivating
and vital area of research with the aim of extracting information from tremendous amount of accumulated
data sets. In this paper a new model for classifying loan risk in banking sector by using data mining. The
model has been built using data form banking sector to predict the status of loans. Three algorithms have
been used to build the proposed model: j48, bayesNet and naiveBayes. By using Weka application, the
model has been implemented and tested. The results has been discussed and a full comparison between
algorithms was conducted. J48 was selected as best algorithm based on accuracy.
Applying Convolutional-GRU for Term Deposit Likelihood PredictionVandanaSharma356
Banks are normally offered two kinds of deposit accounts. It consists of deposits like current/saving account and term deposits like fixed or recurring deposits.For enhancing the maximized profit from bank as well as customer perspective, term deposit can accelerate uplifting of finance fields. This paper focuses on likelihood of term deposit subscription taken by the customers. Bank campaign efforts and customer detail analysis caninfluence term deposit subscription chances. An automated system is approached in this paper that works towards prediction of term deposit investment possibilities in advance. This paper proposes deep learning based hybrid model that stacks Convolutional layers and Recurrent Neural Network (RNN) layers as predictive model. For RNN, Gated Recurrent Unit (GRU) is employed. The proposed predictive model is later compared with other benchmark classifiers such as k-Nearest Neighbor (k-NN), Decision tree classifier (DT), and Multi-layer perceptron classifier (MLP). Experimental study concludesthat proposed model attainsan accuracy of 89.59% and MSE of 0.1041 which outperform wellother baseline models.
Clustering Prediction Techniques in Defining and Predicting Customers Defecti...IJECEIAES
With the growth of the e-commerce sector, customers have more choices, a fact which encourages them to divide their purchases amongst several ecommerce sites and compare their competitors‟ products, yet this increases high risks of churning. A review of the literature on customer churning models reveals that no prior research had considered both partial and total defection in non-contractual online environments. Instead, they focused either on a total or partial defect. This study proposes a customer churn prediction model in an e-commerce context, wherein a clustering phase is based on the integration of the k-means method and the Length-RecencyFrequency-Monetary (LRFM) model. This phase is employed to define churn followed by a multi-class prediction phase based on three classification techniques: Simple decision tree, Artificial neural networks and Decision tree ensemble, in which the dependent variable classifies a particular customer into a customer continuing loyal buying patterns (Non-churned), a partial defector (Partially-churned), and a total defector (Totally-churned). Macroaveraging measures including average accuracy, macro-average of Precision, Recall, and F-1 are used to evaluate classifiers‟ performance on 10-fold cross validation. Using real data from an online store, the results show the efficiency of decision tree ensemble model over the other models in identifying both future partial and total defection.
A TWO-STAGE HYBRID MODEL BY USING ARTIFICIAL NEURAL NETWORKS AS FEATURE CONST...IJDKP
We propose a two-stage hybrid approach with neural networks as the new feature construction algorithms for bankcard response classifications. The hybrid model uses a very simpleneural network structure as the new feature construction tool in the firststage, thenthe newly created features are used asthe additional input variables in logistic regression in the second stage. The modelis compared with the traditional onestage model in credit customer response classification. It is observed that the proposed two-stage model outperforms the one-stage model in terms of accuracy, the area under ROC curve, andKS statistic. By creating new features with theneural network technique, the underlying nonlinear relationships between variables are identified. Furthermore, by using a verysimple neural network structure, the model could overcome the drawbacks of neural networks interms of its long training time, complex topology, and limited interpretability.
The document proposes using several machine learning algorithms, including multilayer perceptron neural networks, logistic regression, support vector machines, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C), to improve personal credit scoring accuracy. The algorithms were tested on a large dataset from a Portuguese bank containing over 400,000 entries. HLVQ-C achieved the most accurate results, outperforming traditional linear methods. The document introduces a "usefulness" measure to evaluate classifiers based on earnings from correctly denying credit to risky applicants and losses from misclassifications.
The document discusses the development of a credit default prediction model called Def_Catch using machine learning algorithms. Def_Catch was trained on a dataset of 100,000 examples with 11 attributes related to borrowers' credit histories and demographics. Random forest achieved the highest accuracy of 93.14% at predicting which borrowers would default in the next 2 years, outperforming logistic regression, naive bayes, decision trees, and multi-layer perceptron models. The top predictors of default included credit utilization, age, number of late payments, debt ratio, and income. Def_Catch provides insights into borrower risk that are difficult to discern from raw data alone.
An Explanation Framework for Interpretable Credit Scoring gerogepatton
With the recent boosted enthusiasm in Artificial Intelligence (AI) and Financial Technology (FinTech),
applications such as credit scoring have gained substantial academic interest. However, despite the evergrowing achievements, the biggest obstacle in most AI systems is their lack of interpretability. This
deficiency of transparency limits their application in different domains including credit scoring. Credit
scoring systems help financial experts make better decisions regarding whether or not to accept a loan
application so that loans with a high probability of default are not accepted. Apart from the noisy and
highly imbalanced data challenges faced by such credit scoring models, recent regulations such as the
`right to explanation' introduced by the General Data Protection Regulation (GDPR) and the Equal Credit
Opportunity Act (ECOA) have added the need for model interpretability to ensure that algorithmic
decisions are understandable and coherent. A recently introduced concept is eXplainable AI (XAI), which
focuses on making black-box models more interpretable. In this work, we present a credit scoring model
that is both accurate and interpretable. For classification, state-of-the-art performance on the Home
Equity Line of Credit (HELOC) and Lending Club (LC) Datasets is achieved using the Extreme Gradient
Boosting (XGBoost) model. The model is then further enhanced with a 360-degree explanation framework,
which provides different explanations (i.e. global, local feature-based and local instance- based) that are
required by different people in different situations. Evaluation through the use of functionally-grounded,
application-grounded and human-grounded analysis shows that the explanations provided are simple and
consistent as well as correct, effective, easy to understand, sufficiently detailed and trustworthy.
Augmentation of Customer’s Profile Dataset Using Genetic AlgorithmRSIS International
Data is the lifeblood of all type of business. Clean,
accurate and complete data is the prerequisite for the decisionmaking
in business process. Data is one of the most valuable
assets for any organization. It is immensely important that the
business focus on the quality of their data as it can help in
increasing the business performance by improving efficiencies,
streamlining operations and consolidating data sources. Good
quality data helps to improve and simplify processes, eliminate
time-consuming rework and externally to enhance a user’s
experience, further translating it to significant financial and
operational benefits [1] [2]. All organizations/ businesses strive to
retain their existing customers and gain new ones. Accurate data
enables the business to improve the customer experience. Data
augmentation adds value to base data by enhancing information
derived from the existing source. Data augmentation can help
reduce the manual intervention required to develop meaningful
information and insight of business data, as well as significantly
enhance data quality. Hence the business can provide unique
customer experience and deliver above and beyond their
expectations. The Data Augmentation is immensely important as
it helps in improving the overall productivity of the business. It
is also important in making the most accurate and relevant
information available quickly for decision making.
This work focuses on augmentation of the customer
dataset using Genetic Algorithm(GA). These augmented data are
used for the purpose of customer behavioral analysis. The data
set consists of the different factors inherent in each situation of
the customer to understand the market strategy. This behavioral
data is used in the earlier work of analyzing the data [13]. It is
found that collecting a very large amount of such data manually
is a very cumbersome process. It is inferred from the earlier
work [13] that the more number of data may give accurate
result. Hence it is decided to enrich the dataset by using Genetic
Algorithm.
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Melissa Moody
Researchers Navin Kasa, Andrew Dahbura, and Charishma Ravoori undertook a capstone project—part of the UVA Data Science Institute Master of Science in Data Science program—that addresses credit card fraud detection through a semi-supervised approach, in which clusters of account profiles are created and used for modeling classifiers.
Text pre-processing of multilingual for sentiment analysis based on social ne...IJECEIAES
Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text preprocessing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition.
A systems engineering methodology for wide area network selectionAlexander Decker
This document describes a study that applies the Analytic Hierarchy Process (AHP) to help a company select the best wide area network (WAN) solution based on their requirements. The document provides background on AHP and reviews related literature on using multi-criteria decision making for selection problems. It then outlines the steps of AHP, including constructing a hierarchy, performing pairwise comparisons, and calculating weights and consistency. Finally, it describes how AHP could be applied to help the hypothetical company evaluate WAN alternatives and select the optimal solution.
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...IRJET Journal
This document discusses machine learning classification algorithms and their applications for predictive analysis in healthcare. It provides an overview of data mining techniques like association, classification, clustering, prediction, and sequential patterns. Specific classification algorithms discussed include Naive Bayes, Support Vector Machine, Decision Trees, K-Nearest Neighbors, Neural Networks, and Bayesian Methods. The document examines examples of these algorithms being used for disease diagnosis, prognosis, and healthcare management. It analyzes their predictive performance on datasets for conditions like breast cancer, heart disease, and ICU readmissions. Overall, the document reviews how machine learning techniques can enhance predictive accuracy for various healthcare problems.
USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWScsandit
The Web considers one of the main sources of customer opinions and reviews which they are represented in two formats; structured data (numeric ratings) and unstructured data (textual comments). Millions of textual comments about goods and services are posted on the web by customers and every day thousands are added, make it a big challenge to read and understand them to make them a useful structured data for customers and decision makers. Sentiment
analysis or Opinion mining is a popular technique for summarizing and analyzing those opinions and reviews. In this paper, we use natural language processing techniques to generate some rules to help us understand customer opinions and reviews (textual comments) written in the Arabic language for the purpose of understanding each one of them and then convert them to a structured data. We use adjectives as a key point to highlight important information in the text then we work around them to tag attributes that describe the subject of the reviews, and we associate them with their values (adjectives).
A predictive system for detection of bankruptcy using machine learning techni...IJDKP
Bankruptcy is a legal procedure that claims a person or organization as a debtor. It is essential to
ascertain the risk of bankruptcy at initial stages to prevent financial losses. In this perspective, different
soft computing techniques can be employed to ascertain bankruptcy. This study proposes a bankruptcy
prediction system to categorize the companies based on extent of risk. The prediction system acts as a
decision support tool for detection of bankruptcy
Growing Health Analytics Without Hiring new StaffThotWave
This document summarizes a presentation about growing health analytics capabilities without hiring new staff. It discusses healthcare's data problem and how survival depends on competing well with data. The solution presented is to develop existing staff into "data champions" through training. The presentation outlines investing in self-service data access, using analytic tools, developing enterprise analytics literacy, and training curious staff to analyze data and become independent. This empowers organizations to leverage their data without extensive hiring.
Developing a PhD Research Topic for Your Research| PhD Assistance UKPhDAssistanceUK
According to UGC report, every year approximately 77,000 scholars are enrolling for PhD research programs in India. but only 1/3rd of them successfully complete the doctorate every year (Marg, 2015). This result clearly indicates the difficulty and standard of evaluation of this program, so it is necessary that the scholar thoroughly analyze and select a good PhD research topic before plunging into the research work. There are many methodologies and techniques introduced over a period for effective selection and decision making in various field. Data Driven Decision Making which is also referred as Data Based Decision Making is a modern method which was found to be effective and efficient for strategic and systematic development and decision making. So, this article summarizes the framework, effectiveness, benefits of DDDM in selection of Research topics for the PhD scholars.
For more Details:-
UK: +44 7424997337
Email : info@phdassistance.com
Keyword Based Service Recommendation system for Hotel System using Collaborat...IRJET Journal
This document presents a keyword-based service recommendation system using collaborative filtering to address challenges with traditional recommender systems when dealing with large datasets. The proposed system captures user preferences through keywords selected from a candidate list. It identifies similar users based on keyword similarities in preferences. It then calculates personalized ratings for services for a given user and generates a personalized recommendation list. The system aims to provide more accurate and scalable recommendations compared to existing approaches by incorporating keywords to represent user preferences.
Evidence Data Preprocessing for Forensic and Legal AnalyticsCSCJournals
The document discusses best practices for preprocessing evidentiary data from legal cases or forensic investigations for use in analytical experiments. It outlines key steps like identifying the analytical aim or problem based on the case scope or investigation protocol, understanding the case data through assessment and exploration of its format, features, quality, and potential issues. Challenges of working with common text-based case data like emails, social media posts are also discussed. The goal is to clean and transform raw data into a suitable format for machine learning or other advanced analytical techniques while maintaining integrity and relevance to the case.
CHURN ANALYSIS AND PLAN RECOMMENDATION FOR TELECOM OPERATORSJournal For Research
This document summarizes research on customer churn prediction and retention strategies in the telecommunications industry. It provides an abstract for a research paper on using a hybrid machine learning classifier and rule engine to predict customer churn and recommend retention plans. The document then reviews related work applying techniques like fuzzy logic, ordinal regression, artificial neural networks, and ensemble methods to predict churn. It outlines the problem of predicting churn and proposing recommendation rules. The proposed solution is a hybrid machine learning model to predict churn with high accuracy and explain reasons for churn to help recommend plans to retain customers.
Instance Selection and Optimization of Neural NetworksITIIIndustries
Credit scoring is an important tool in financial institutions, which can be used in credit granting decision. Credit applications are marked by credit scoring models and those with high marks will be treated as “good”, while those with low marks will be regarded as “bad”. As data mining technique develops, automatic credit scoring systems are warmly welcomed for their high efficiency and objective judgments. Many machine learning algorithms have been applied in training credit scoring models, and ANN is one of them with good performance. This paper presents a higher accuracy credit scoring model based on MLP neural networks trained with back propagation algorithm. Our work focuses on enhancing credit scoring models in three aspects: optimize data distribution in datasets using a new method called Average Random Choosing; compare effects of training-validation-test instances numbers; and find the most suitable number of hidden units. Another contribution of this paper is summarizing the tendency of scoring accuracy of models when the number of hidden units increases. The experiment results show that our methods can achieve high credit scoring accuracy with imbalanced datasets. Thus, credit granting decision can be made by data mining methods using MLP neural networks.
دندروف حالتی بیماری گونه است که در حدود 5 در صد کل جمعیت را درگیر نموده است . این بیماری اکثرا پس از بلوغ و در سنین مابین 20 الی 30 سالگی رخ می نماید و البته در گیری مردان بیشتر از خانم هاست هر چند ممکن است دندروف را در سنینی مابین 12 الی 80 سالگی نیز مشاهده نمائیم. دندروف با فلس های جدا شده از کف سر شناخته می شود و مشخصه ای به نام سبوره دارد که در حقیقت همان فلسهای جدا شده از کف سر فرد مبتلا می باشد و می دانیم که سبوره پیش ساز و مقدمه بروز سبورئیک درماتاتیس است و در اکثر مواقع دندروف و سبورئیک درماتایتیس به صورت همزمان دیده می شود
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING mlaij
Nowadays, There are many risks related to bank loans, for the bank and for those who get the loans. The
analysis of risk in bank loans need understanding what is the meaning of risk. In addition, the number of
transactions in banking sector is rapidly growing and huge data volumes are available which represent
the customers behavior and the risks around loan are increased. Data Mining is one of the most motivating
and vital area of research with the aim of extracting information from tremendous amount of accumulated
data sets. In this paper a new model for classifying loan risk in banking sector by using data mining. The
model has been built using data form banking sector to predict the status of loans. Three algorithms have
been used to build the proposed model: j48, bayesNet and naiveBayes. By using Weka application, the
model has been implemented and tested. The results has been discussed and a full comparison between
algorithms was conducted. J48 was selected as best algorithm based on accuracy.
Applying Convolutional-GRU for Term Deposit Likelihood PredictionVandanaSharma356
Banks are normally offered two kinds of deposit accounts. It consists of deposits like current/saving account and term deposits like fixed or recurring deposits.For enhancing the maximized profit from bank as well as customer perspective, term deposit can accelerate uplifting of finance fields. This paper focuses on likelihood of term deposit subscription taken by the customers. Bank campaign efforts and customer detail analysis caninfluence term deposit subscription chances. An automated system is approached in this paper that works towards prediction of term deposit investment possibilities in advance. This paper proposes deep learning based hybrid model that stacks Convolutional layers and Recurrent Neural Network (RNN) layers as predictive model. For RNN, Gated Recurrent Unit (GRU) is employed. The proposed predictive model is later compared with other benchmark classifiers such as k-Nearest Neighbor (k-NN), Decision tree classifier (DT), and Multi-layer perceptron classifier (MLP). Experimental study concludesthat proposed model attainsan accuracy of 89.59% and MSE of 0.1041 which outperform wellother baseline models.
Clustering Prediction Techniques in Defining and Predicting Customers Defecti...IJECEIAES
With the growth of the e-commerce sector, customers have more choices, a fact which encourages them to divide their purchases amongst several ecommerce sites and compare their competitors‟ products, yet this increases high risks of churning. A review of the literature on customer churning models reveals that no prior research had considered both partial and total defection in non-contractual online environments. Instead, they focused either on a total or partial defect. This study proposes a customer churn prediction model in an e-commerce context, wherein a clustering phase is based on the integration of the k-means method and the Length-RecencyFrequency-Monetary (LRFM) model. This phase is employed to define churn followed by a multi-class prediction phase based on three classification techniques: Simple decision tree, Artificial neural networks and Decision tree ensemble, in which the dependent variable classifies a particular customer into a customer continuing loyal buying patterns (Non-churned), a partial defector (Partially-churned), and a total defector (Totally-churned). Macroaveraging measures including average accuracy, macro-average of Precision, Recall, and F-1 are used to evaluate classifiers‟ performance on 10-fold cross validation. Using real data from an online store, the results show the efficiency of decision tree ensemble model over the other models in identifying both future partial and total defection.
A TWO-STAGE HYBRID MODEL BY USING ARTIFICIAL NEURAL NETWORKS AS FEATURE CONST...IJDKP
We propose a two-stage hybrid approach with neural networks as the new feature construction algorithms for bankcard response classifications. The hybrid model uses a very simpleneural network structure as the new feature construction tool in the firststage, thenthe newly created features are used asthe additional input variables in logistic regression in the second stage. The modelis compared with the traditional onestage model in credit customer response classification. It is observed that the proposed two-stage model outperforms the one-stage model in terms of accuracy, the area under ROC curve, andKS statistic. By creating new features with theneural network technique, the underlying nonlinear relationships between variables are identified. Furthermore, by using a verysimple neural network structure, the model could overcome the drawbacks of neural networks interms of its long training time, complex topology, and limited interpretability.
The document proposes using several machine learning algorithms, including multilayer perceptron neural networks, logistic regression, support vector machines, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C), to improve personal credit scoring accuracy. The algorithms were tested on a large dataset from a Portuguese bank containing over 400,000 entries. HLVQ-C achieved the most accurate results, outperforming traditional linear methods. The document introduces a "usefulness" measure to evaluate classifiers based on earnings from correctly denying credit to risky applicants and losses from misclassifications.
The document discusses the development of a credit default prediction model called Def_Catch using machine learning algorithms. Def_Catch was trained on a dataset of 100,000 examples with 11 attributes related to borrowers' credit histories and demographics. Random forest achieved the highest accuracy of 93.14% at predicting which borrowers would default in the next 2 years, outperforming logistic regression, naive bayes, decision trees, and multi-layer perceptron models. The top predictors of default included credit utilization, age, number of late payments, debt ratio, and income. Def_Catch provides insights into borrower risk that are difficult to discern from raw data alone.
An Explanation Framework for Interpretable Credit Scoring gerogepatton
With the recent boosted enthusiasm in Artificial Intelligence (AI) and Financial Technology (FinTech),
applications such as credit scoring have gained substantial academic interest. However, despite the evergrowing achievements, the biggest obstacle in most AI systems is their lack of interpretability. This
deficiency of transparency limits their application in different domains including credit scoring. Credit
scoring systems help financial experts make better decisions regarding whether or not to accept a loan
application so that loans with a high probability of default are not accepted. Apart from the noisy and
highly imbalanced data challenges faced by such credit scoring models, recent regulations such as the
`right to explanation' introduced by the General Data Protection Regulation (GDPR) and the Equal Credit
Opportunity Act (ECOA) have added the need for model interpretability to ensure that algorithmic
decisions are understandable and coherent. A recently introduced concept is eXplainable AI (XAI), which
focuses on making black-box models more interpretable. In this work, we present a credit scoring model
that is both accurate and interpretable. For classification, state-of-the-art performance on the Home
Equity Line of Credit (HELOC) and Lending Club (LC) Datasets is achieved using the Extreme Gradient
Boosting (XGBoost) model. The model is then further enhanced with a 360-degree explanation framework,
which provides different explanations (i.e. global, local feature-based and local instance- based) that are
required by different people in different situations. Evaluation through the use of functionally-grounded,
application-grounded and human-grounded analysis shows that the explanations provided are simple and
consistent as well as correct, effective, easy to understand, sufficiently detailed and trustworthy.
Augmentation of Customer’s Profile Dataset Using Genetic AlgorithmRSIS International
Data is the lifeblood of all type of business. Clean,
accurate and complete data is the prerequisite for the decisionmaking
in business process. Data is one of the most valuable
assets for any organization. It is immensely important that the
business focus on the quality of their data as it can help in
increasing the business performance by improving efficiencies,
streamlining operations and consolidating data sources. Good
quality data helps to improve and simplify processes, eliminate
time-consuming rework and externally to enhance a user’s
experience, further translating it to significant financial and
operational benefits [1] [2]. All organizations/ businesses strive to
retain their existing customers and gain new ones. Accurate data
enables the business to improve the customer experience. Data
augmentation adds value to base data by enhancing information
derived from the existing source. Data augmentation can help
reduce the manual intervention required to develop meaningful
information and insight of business data, as well as significantly
enhance data quality. Hence the business can provide unique
customer experience and deliver above and beyond their
expectations. The Data Augmentation is immensely important as
it helps in improving the overall productivity of the business. It
is also important in making the most accurate and relevant
information available quickly for decision making.
This work focuses on augmentation of the customer
dataset using Genetic Algorithm(GA). These augmented data are
used for the purpose of customer behavioral analysis. The data
set consists of the different factors inherent in each situation of
the customer to understand the market strategy. This behavioral
data is used in the earlier work of analyzing the data [13]. It is
found that collecting a very large amount of such data manually
is a very cumbersome process. It is inferred from the earlier
work [13] that the more number of data may give accurate
result. Hence it is decided to enrich the dataset by using Genetic
Algorithm.
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Melissa Moody
Researchers Navin Kasa, Andrew Dahbura, and Charishma Ravoori undertook a capstone project—part of the UVA Data Science Institute Master of Science in Data Science program—that addresses credit card fraud detection through a semi-supervised approach, in which clusters of account profiles are created and used for modeling classifiers.
Text pre-processing of multilingual for sentiment analysis based on social ne...IJECEIAES
Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text preprocessing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition.
A systems engineering methodology for wide area network selectionAlexander Decker
This document describes a study that applies the Analytic Hierarchy Process (AHP) to help a company select the best wide area network (WAN) solution based on their requirements. The document provides background on AHP and reviews related literature on using multi-criteria decision making for selection problems. It then outlines the steps of AHP, including constructing a hierarchy, performing pairwise comparisons, and calculating weights and consistency. Finally, it describes how AHP could be applied to help the hypothetical company evaluate WAN alternatives and select the optimal solution.
IRJET- Machine Learning Classification Algorithms for Predictive Analysis in ...IRJET Journal
This document discusses machine learning classification algorithms and their applications for predictive analysis in healthcare. It provides an overview of data mining techniques like association, classification, clustering, prediction, and sequential patterns. Specific classification algorithms discussed include Naive Bayes, Support Vector Machine, Decision Trees, K-Nearest Neighbors, Neural Networks, and Bayesian Methods. The document examines examples of these algorithms being used for disease diagnosis, prognosis, and healthcare management. It analyzes their predictive performance on datasets for conditions like breast cancer, heart disease, and ICU readmissions. Overall, the document reviews how machine learning techniques can enhance predictive accuracy for various healthcare problems.
USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWScsandit
The Web considers one of the main sources of customer opinions and reviews which they are represented in two formats; structured data (numeric ratings) and unstructured data (textual comments). Millions of textual comments about goods and services are posted on the web by customers and every day thousands are added, make it a big challenge to read and understand them to make them a useful structured data for customers and decision makers. Sentiment
analysis or Opinion mining is a popular technique for summarizing and analyzing those opinions and reviews. In this paper, we use natural language processing techniques to generate some rules to help us understand customer opinions and reviews (textual comments) written in the Arabic language for the purpose of understanding each one of them and then convert them to a structured data. We use adjectives as a key point to highlight important information in the text then we work around them to tag attributes that describe the subject of the reviews, and we associate them with their values (adjectives).
A predictive system for detection of bankruptcy using machine learning techni...IJDKP
Bankruptcy is a legal procedure that claims a person or organization as a debtor. It is essential to
ascertain the risk of bankruptcy at initial stages to prevent financial losses. In this perspective, different
soft computing techniques can be employed to ascertain bankruptcy. This study proposes a bankruptcy
prediction system to categorize the companies based on extent of risk. The prediction system acts as a
decision support tool for detection of bankruptcy
Growing Health Analytics Without Hiring new StaffThotWave
This document summarizes a presentation about growing health analytics capabilities without hiring new staff. It discusses healthcare's data problem and how survival depends on competing well with data. The solution presented is to develop existing staff into "data champions" through training. The presentation outlines investing in self-service data access, using analytic tools, developing enterprise analytics literacy, and training curious staff to analyze data and become independent. This empowers organizations to leverage their data without extensive hiring.
Developing a PhD Research Topic for Your Research| PhD Assistance UKPhDAssistanceUK
According to UGC report, every year approximately 77,000 scholars are enrolling for PhD research programs in India. but only 1/3rd of them successfully complete the doctorate every year (Marg, 2015). This result clearly indicates the difficulty and standard of evaluation of this program, so it is necessary that the scholar thoroughly analyze and select a good PhD research topic before plunging into the research work. There are many methodologies and techniques introduced over a period for effective selection and decision making in various field. Data Driven Decision Making which is also referred as Data Based Decision Making is a modern method which was found to be effective and efficient for strategic and systematic development and decision making. So, this article summarizes the framework, effectiveness, benefits of DDDM in selection of Research topics for the PhD scholars.
For more Details:-
UK: +44 7424997337
Email : info@phdassistance.com
Keyword Based Service Recommendation system for Hotel System using Collaborat...IRJET Journal
This document presents a keyword-based service recommendation system using collaborative filtering to address challenges with traditional recommender systems when dealing with large datasets. The proposed system captures user preferences through keywords selected from a candidate list. It identifies similar users based on keyword similarities in preferences. It then calculates personalized ratings for services for a given user and generates a personalized recommendation list. The system aims to provide more accurate and scalable recommendations compared to existing approaches by incorporating keywords to represent user preferences.
Evidence Data Preprocessing for Forensic and Legal AnalyticsCSCJournals
The document discusses best practices for preprocessing evidentiary data from legal cases or forensic investigations for use in analytical experiments. It outlines key steps like identifying the analytical aim or problem based on the case scope or investigation protocol, understanding the case data through assessment and exploration of its format, features, quality, and potential issues. Challenges of working with common text-based case data like emails, social media posts are also discussed. The goal is to clean and transform raw data into a suitable format for machine learning or other advanced analytical techniques while maintaining integrity and relevance to the case.
CHURN ANALYSIS AND PLAN RECOMMENDATION FOR TELECOM OPERATORSJournal For Research
This document summarizes research on customer churn prediction and retention strategies in the telecommunications industry. It provides an abstract for a research paper on using a hybrid machine learning classifier and rule engine to predict customer churn and recommend retention plans. The document then reviews related work applying techniques like fuzzy logic, ordinal regression, artificial neural networks, and ensemble methods to predict churn. It outlines the problem of predicting churn and proposing recommendation rules. The proposed solution is a hybrid machine learning model to predict churn with high accuracy and explain reasons for churn to help recommend plans to retain customers.
Instance Selection and Optimization of Neural NetworksITIIIndustries
Credit scoring is an important tool in financial institutions, which can be used in credit granting decision. Credit applications are marked by credit scoring models and those with high marks will be treated as “good”, while those with low marks will be regarded as “bad”. As data mining technique develops, automatic credit scoring systems are warmly welcomed for their high efficiency and objective judgments. Many machine learning algorithms have been applied in training credit scoring models, and ANN is one of them with good performance. This paper presents a higher accuracy credit scoring model based on MLP neural networks trained with back propagation algorithm. Our work focuses on enhancing credit scoring models in three aspects: optimize data distribution in datasets using a new method called Average Random Choosing; compare effects of training-validation-test instances numbers; and find the most suitable number of hidden units. Another contribution of this paper is summarizing the tendency of scoring accuracy of models when the number of hidden units increases. The experiment results show that our methods can achieve high credit scoring accuracy with imbalanced datasets. Thus, credit granting decision can be made by data mining methods using MLP neural networks.
دندروف حالتی بیماری گونه است که در حدود 5 در صد کل جمعیت را درگیر نموده است . این بیماری اکثرا پس از بلوغ و در سنین مابین 20 الی 30 سالگی رخ می نماید و البته در گیری مردان بیشتر از خانم هاست هر چند ممکن است دندروف را در سنینی مابین 12 الی 80 سالگی نیز مشاهده نمائیم. دندروف با فلس های جدا شده از کف سر شناخته می شود و مشخصه ای به نام سبوره دارد که در حقیقت همان فلسهای جدا شده از کف سر فرد مبتلا می باشد و می دانیم که سبوره پیش ساز و مقدمه بروز سبورئیک درماتاتیس است و در اکثر مواقع دندروف و سبورئیک درماتایتیس به صورت همزمان دیده می شود
A nova linha Trop da Kenner foi projetada para proporcionar mais conforto e ergonomia aos usuários de esportes ao ar livre, com cabedal e palmilha impermeáveis e solado com design que melhora a aderência. O cabedal em gel possui linhas que dão leveza e flexibilidade, e a palmilha incorpora a tecnologia AMP Reactor para absorver melhor o impacto. A linha apresenta diversas opções de cores e modelos.
O documento descreve vários modelos de sandálias da marca Trop, incluindo suas características como palmilha impermeável, cabedal em gel, solado frestado e design inspirado em barcos à vela, tornando-as práticas e seguras para esportes aquáticos.
O documento descreve a linha de sandálias Kivah, que oferece modelos robustos com diferentes designs, materiais como nylon e couro, e recursos como palmilhas macias e solados antiderrapantes para prover conforto e segurança em aventuras. Vários modelos como Kivah Neo Silver, Kivah Spider Silver e Kivah Bull são destacados com detalhes sobre cores e materiais diferenciados.
نگاهی به کازمتیک از دریچه فلسفه ، دکتر محمد بقایی داروساز و پژوهشگر Mohammad Baghaei
تراژدي نوعی نوشته و نمایش است که موضوعی غم انگیـز دارد و غالبـاً بـدبختی، شـکنجه، وحشت، درد و رنج را نشان میدهد. در تـراژدي دو عنصر نیرومند و بالقوه و باارزش که سرنوشت آنها را مقابل هم قرار داده است در تقابل با یکدیگر قرار میگیرند.(پاورقی 1)
ارسطو تراژدی را این طور تعریف میکند: نمایش عملی است که با ایجاد خوف و تحریک احساست ترحم آمیز مردم هوای نفسانی آنها را ترمیم و اصلاح میکند.
Overview of privacy and data protection considerations for DEVELOPTrilateral Research
This document discusses ethical, privacy, and data protection considerations for a project called DEVELOP. It outlines relevant ethical values like autonomy, dignity, inclusion and beneficence. It also discusses the right to privacy under the European Convention on Human Rights and EU data protection law. The document provides an overview of the EU Data Protection Directive and the new General Data Protection Regulation. It raises specific privacy and data protection issues like informed consent, data minimization, and anonymity that DEVELOP should address.
The document outlines a vision and plan to promote a cashless society in India through digital literacy initiatives. It aims to educate citizens on cashless transactions, help small businesses adopt digital payments, and remove fears around online banking. Key aspects of the plan include developing a personalized learning app and booklet on cashless payments, creating self-organized learning environments in village libraries led by student volunteers, and holding a "Digital Literacy Week" campaign. The initiatives seek to address India's low digital literacy rate of 30% and the need to teach cashless skills through hands-on learning versus traditional classroom methods. The end goal is to empower citizens across India to participate in and benefit from a digital financial system.
“One Library Per Village” is a revolutionary notion, and when people don’t have free access to books and internet , then communities are like radios without batteries. You cut people off from essential sources of information — mythical, practical, linguistic, political — and you break them.
Credit Scoring Using CART Algorithm and Binary Particle Swarm Optimization IJECEIAES
Credit scoring is a procedure that exists in every financial institution. A way to predict whether the debtor was qualified to be given the loan or not and has been a major concern in the overall steps of the loan process. Almost all banks and other financial institutions have their own credit scoring methods. Nowadays, data mining approach has been accepted to be one of the wellknown methods. Certainly, accuracy was also a major issue in this approach. This research proposed a hybrid method using CART algorithm and Binary Particle Swarm Optimization. Performance indicators that are used in this research are classification accuracy, error rate, sensitivity, specificity, and precision. Experimental results based on the public dataset showed that the proposed method accuracy is 78 % and 87.53 %. In compare to several popular algorithms, such as neural network, logistic regression and support vector machine, the proposed method showed an outstanding performance.
# Project 03 - Data Mining on Financial Data
In this project, various models like Logistic Regression, KNN, SVM, and Random Forest have been applied to three finance-related datasets in order to discover the insights from the datasets. The methods will be applied to the datasets in R studio and corresponding outputs will be shown in this paper. Then the applied methods will be compared with each other to identify how well the method is performing on each of the datasets and then finally the better method will be chosen for each of the datasets. The methods were explained in detail along with their advantages with the help of the information from the relevant papers. Every method which has been applied to the datasets has been confirmed whether it follows the data mining methodologies. Data mining methodologies like CRISP-DM, KDD and SEMMA will also be explained in detail via this paper. The process flow of the data mining methodologies will be explained and also will be made sure that whether the process flow has been followed while applying each method on the datasets. Data mining is now considered as a major factor in the risk management process of financial institutions. Even though various data mining tools are existing in the market, this paper allows readers to understand how the algorithm works on the dataset and how to justify whether an algorithm’s prediction.
Supervised and unsupervised data mining approaches in loan default prediction IJECEIAES
Given the paramount importance of data mining in organizations and the possible contribution of a data-driven customer classification recommender systems for loan-extending financial institutions, the study applied supervised and supervised data mining approaches to derive the best classifier of loan default. A total of 900 instances with determined attributes and class labels were used for the training and cross-validation processes while prediction used 100 new instances without class labels. In the training phase, J48 with confidence factor of 50% attained the highest classification accuracy (76.85%), k-nearest neighbors (k-NN) 3 the highest (78.38%) in IBk variants, naïve Bayes has a classification accuracy of 76.65%, and logistic has 77.31% classification accuracy. k-NN 3 and logistic have the highest classification accuracy, F-measures, and kappa statistics. Implementation of these algorithms to the test set yielded 48 non-defaulters and 52 defaulters for k-NN 3 while 44 non-defaulters and 56 defaulters under logistic. Implications were discussed in the paper.
This document presents a study comparing several machine learning models for personal credit scoring: logistic regression, multilayer perceptron, support vector machine, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy and was the most useful model according to a proposed measure that considers earnings from denying bad credits and losses from denying good credits. While other models had higher error rates for good credits, HLVQ-C balanced accuracy and usefulness the best, making it suitable for commercial credit scoring applications.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.Souma Maiti
This document describes a machine learning model for predicting loan approvals. It discusses collecting loan application data and preprocessing the data which includes cleaning, feature selection, and scaling. Various machine learning algorithms are trained on the data including logistic regression, decision trees, random forest, support vector machines, and gradient boosting. Their accuracies are compared and random forest is found to perform best. The optimal model is deployed with a user interface created using Streamlit. The system aims to automate and improve the loan approval process for banks.
Credit risk assessment with imbalanced data sets using SVMsIRJET Journal
This document discusses using support vector machines (SVMs) to assess credit risk with imbalanced data sets. SVMs have limited performance with imbalanced credit data where unpaid loans are less frequent than paid loans. The author develops an SVM model using two data resampling techniques - random oversampling and SMOTE - to address class imbalance. Performance is evaluated using various criteria like accuracy, sensitivity, specificity, and AUC. The results suggest resampling data can improve SVM performance for accurate credit risk prediction with imbalanced data.
Loan Default Prediction Using Machine Learning TechniquesIRJET Journal
This document discusses using machine learning techniques to predict loan defaults. It begins with an abstract that outlines using data collection, cleaning, and performance assessment to predict loan defaulters. It then discusses implementing models like decision trees and KNN (K-nearest neighbors) for classification and regression. The document evaluates the performance of these models on loan default prediction and concludes the KNN model performs better. It proposes using both models and comparing their accuracy to improve prediction performance and minimize risk.
Credit Card Fraud Detection Using Machine LearningIRJET Journal
This document discusses machine learning methods for credit card fraud detection using imbalanced datasets. It provides a literature review summarizing several research papers on this topic. The papers evaluated algorithms like logistic regression, decision trees, random forests and support vector machines for credit card fraud prediction. Random forests generally performed best, accurately predicting defaults 77% of the time in one study. The document emphasizes that imbalanced datasets can bias models towards the majority class. Sampling techniques like adaptive synthetic sampling were found to improve classifier performance on minority default classes. No single algorithm can fully address class imbalance, and more diverse credit card data is needed to develop better predictive models.
Credit Card Fraud Detection Using Machine LearningIRJET Journal
This document summarizes a research paper that compares various machine learning algorithms for credit card fraud detection using an imbalanced credit card transaction dataset. The paper analyzes logistic regression, decision tree, random forest, and support vector machine classifiers on the basis of precision, recall, and accuracy. It finds that random forest provides the best performance with an accuracy of 77% for predicting credit card defaulters from the imbalanced data. The paper concludes that machine learning algorithms can effectively identify high-risk credit card users to help reduce financial losses for credit card issuers.
BANK LOAN PREDICTION USING MACHINE LEARNINGIRJET Journal
This document discusses using machine learning models to predict whether a bank will approve a loan application. It begins with an abstract that explains the goal is to accurately predict loan approvals to help banks and reduce risk. It then reviews related literature on using algorithms like decision trees, naive Bayes, random forests, and support vector machines (SVM) for loan prediction. The document proposes using models trained on applicant data to determine their likelihood of repayment. It describes collecting data, preparing it, selecting SVM as the best performing model, training and testing the model, and saving it to make predictions on new applicants. Accurately predicting approvals could help banks process applications faster and with less risk.
This document summarizes and compares various machine learning models for credit scoring and investment decisions using explainable AI techniques. It finds that ensemble classifiers like random forests and neural networks outperform individual classifiers. LIME and SHAP techniques are used to explain ML credit scoring models. The study also develops new investment models using ML algorithms to maximize profit while minimizing risk. A variety of ML algorithms are tested, including logistic regression, decision trees, LDA, QDA, AdaBoost, random forests, and neural networks. The random forest and AdaBoost models are tuned with hyperparameters. Model performance is evaluated using metrics like accuracy, derived from a confusion matrix.
Meta Classification Technique for Improving Credit Card Fraud Detection IJSTA
This document summarizes a research paper that proposes an ensemble meta classification technique using D-TREE, SVM, KNN and FA algorithms to improve credit card fraud detection. The researchers tested their proposed method on credit card transaction data and found that the ensemble method of FA-D-TREE, SVM, KNN outperformed using the individual algorithms alone. It achieved a lower error rate compared to other approaches. The document provides background on credit card fraud detection and reviews related work applying neural networks, decision trees, SVM and frequent item set mining methods.
Synthetic feature generation to improve accuracy in prediction of credit limitsBOHRInternationalJou1
Financial institutions use various data mining algorithms to determine the credit limits for individuals using features
like age, education, employment, gender, income, and marital status. But, there is still a question of accurate
predictability, that is, how accurate can an institution be in predicting risk and granting credit levels. If an institution
grants too low of a credit limit/loan for an individual, then the institution may lose business to competitors, but
if the institution grants too high of a credit limit/loan, then the institution may lose money if that individual does
not repay the credit/loan. The novelty of this work is that it shows how to improve the accuracy in predicting
credit limits/loan amounts using synthetic feature generation. By creating secondary groupings and including
both the original binning and the synthetic bins, the classification accuracy and other statistical measures like
precision and ROC improved substantially. Hence, our research showed that without synthetic feature generation,
the classification rates were low, and the use of synthetic features greatly improved the classification accuracy and
other statistical measures.
This document discusses the rise of predictive analytics and its value in enterprise decision making. It begins by explaining how predictive analytics has expanded from niche uses to a widely adopted competitive technique, fueled by big data, improved analytics tools, and demonstrated successes. A classic example given is credit scoring, which uses predictive models to assess credit risk. The document then provides examples of other areas where predictive models generate value, such as marketing, customer retention, pricing, and fraud prevention. It discusses how effective predictive models are built by using statistical techniques on data that describes predictive factors and outcomes. The document argues that predictive models provide the most value when applied to processes involving large volumes of similar decisions that have significant financial or other impacts, and where relevant electronic
The document presents a study comparing various machine learning algorithms for personal credit scoring, including logistic regression, multilayer perceptron, support vector machines, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy on both datasets and the best performance on a proposed "usefulness" metric, making it the most effective model for credit scoring applications according to this research.
In Banking Loan Approval Prediction Using Machine LearningIRJET Journal
This document discusses using machine learning algorithms to predict loan approvals. It analyzes loan data using decision trees, logistic regression, and random forest algorithms. The random forest algorithm achieved the highest accuracy rate of 88.53% compared to 85% for decision trees and 83.04% for logistic regression. Therefore, the random forest algorithm is concluded to be best for loan approval prediction. Future work could involve applying these algorithms to other loan data sets and exploring additional machine learning methods.
PROBABILISTIC CREDIT SCORING FOR COHORTS OF BORROWERSAndresz26
This document proposes a methodology for probabilistic credit scoring of cohorts (groups) of borrowers with similar characteristics. It begins by discussing existing credit scoring models that focus on individual applicants. It then outlines a methodology to score groups that considers the diversity within each group and the distribution of default probabilities. The key steps are: 1) estimating default probabilities for individuals using a logit model, 2) estimating distributions of variables for each group, 3) simulating individuals in each group to capture diversity, and 4) incorporating residual variability to account for unobserved factors. This allows ranking groups by risk while addressing uncertainty about a group's composition and risk preferences of decision makers.
This document summarizes a research study that developed a discriminant analysis model to classify loan applications as accepted or rejected using ranked data. The study used a sample of 350 loan applications, including variables like credit rating, occupation, loan-to-value ratio, and payment-to-income ratio. Rank transformation was applied to minimize outliers and non-normality. Statistical software was used to generate classification functions and classify applications. The resulting model based on ranked data provided accurate classifications without violating assumptions of traditional discriminant analysis.
The document discusses a loan prediction system project that aims to automate the loan approval process and make it more efficient. A machine learning model will be developed using datasets to predict if borrowers should be approved for loans based on variables like income, employment status, etc. The proposed system will have a web interface. It will be implemented in two phases, with the front end collecting user inputs and the back end applying a logistic regression model trained on past data to make predictions. The model aims to accurately predict loan eligibility without manual reviews of each applicant.
This document discusses the application of machine learning algorithms for fraud detection in the banking sector. It proposes using Classification and Regression Tree (CART), AdaBoost, LogitBoost, and Bagging algorithms to classify banking data and detect fraud. An experiment is conducted to analyze the performance of these algorithms on a banking data set. The results show that the Bagging algorithm has the lowest misclassification rate, indicating it performs better than the other algorithms at classifying banking data for fraud detection. In conclusion, the Bagging algorithm is deemed the best performing of the meta-learning algorithms analyzed for fraud detection in banking data.
Similar to credit scoring paper published in eswa (20)
1. Computational time reduction for credit scoring: An integrated approach based on
support vector machine and stratified sampling method
Akhil Bandhu Hens a
, Manoj Kumar Tiwari b,⇑
a
Department of Mathematics, Indian Institute of Technology, Kharagpur, India
b
Department of Industrial Engineering and Management, Indian Institute of Technology, Kharagpur, India
a r t i c l e i n f o
Keywords:
Support vector machine
Credit scoring
F score
Stratified sampling
a b s t r a c t
With the rapid growth of credit industry, credit scoring model has a great significance to issue a credit
card to the applicant with a minimum risk. So credit scoring is very important in financial firm like bans
etc. With the previous data, a model is established. From that model is decision is taken whether he will
be granted for issuing loans, credit cards or he will be rejected. There are several methodologies to con-
struct credit scoring model i.e. neural network model, statistical classification techniques, genetic pro-
gramming, support vector model etc. Computational time for running a model has a great importance
in the 21st century. The algorithms or models with less computational time are more efficient and thus
gives more profit to the banks or firms. In this study, we proposed a new strategy to reduce the compu-
tational time for credit scoring. In this approach we have used SVM incorporated with the concept of
reduction of features using F score and taking a sample instead of taking the whole dataset to create
the credit scoring model. We run our method two real dataset to see the performance of the new method.
We have compared the result of the new method with the result obtained from other well known
method. It is shown that new method for credit scoring model is very much competitive to other method
in the view of its accuracy as well as new method has a less computational time than the other methods.
Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction
Recently credit industry is growing rapidly with the rapid growth
of financial sector, banking sector. The influence of the credit indus-
try has been seen the consumer market of the developing and the
developed countries. The competitiveness in credit industry is in
an larger extent specially in emerging economies. Many financial
firms have faced a tight competition in the market with respect to
smooth service, efficient service, more benefit to the customers.
Credit card is one of the services which every customer wants. Credit
cards make the financial transaction a little bit easier. The number of
applications for issuing credit cards is also increasing. Every finan-
cial firm wants to give better service to the customers and make
large profit. But there is a constraint to issue more credit cards with-
out any decision. There many some fraud applicant. They can misuse
the credit cards. For banking institutions, loans are often the primary
source of credit risk. Credit scoring models are used to evaluate the
risk. Banks also want to issue more credit cards to increase their
profits with minimizing the risk. It has been seen that some of the
applications are fraud. Credit scoring models have been extensively
used for the credit admission valuation. Many quantitative methods
have been developed for the prediction of credits more accurately.
Credit scoring models helps to categorize the applicants into two
classes: in one case the applicants are accepted and in other cases
applicants are rejected. During application procedure various infor-
mation about the applicant like income, bank balance, profession,
family background, educational background etc. are demanded.
Not all of these characteristics are quantitative. Some characteristics
are also qualitative. These qualitative characteristics are converted
into quantitative characteristics with some standard procedure for
the sake of computation. These characteristics are needed for credit
scoring model specification. The objectives of credit scoring models
are reduction of the cost for credit analysis, to make credit decision
making comparatively faster, to make it efficient.
Recently, many researchers have done their research work on
increasing the efficiency of credit scoring model, reducing the com-
putation time. As a result various methods for the decision making
of credit analysis have been proposed. These approaches help to
detect fraud, assess creditworthiness etc. Baesens et al. (2003) pre-
sented the various classification techniques for credit scoring. An
effective credit analysis also helps to issue loans with minimum
risk. The previous historical data of the applicants for credit cards
are necessary to create a credit scoring model. The credit scoring
model has been created using the previous data consisting of the
applicants’ information and decision taken to their applications.
Then this model is applied to a new applicant for credit cards. From
0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.12.057
⇑ Corresponding author. Tel.: +91 3222 283746.
E-mail address: mkt09@hotmail.com (M.K. Tiwari).
Expert Systems with Applications 39 (2012) 6774–6781
Contents lists available at SciVerse ScienceDirect
Expert Systems with Applications
journal homepage: www.elsevier.com/locate/eswa
2. this model creditor makes a decision on a new applicant whether
he will be accepted or rejected. With the increase in the accuracy
of credit scoring model, the creditors risk is decreased.
With the increasing importance of credit scoring model, this
field has invoked interests to many researchers to work on it. Filter
approach employs better results for credit scoring datasets as com-
pare to wrapper approach (Somol, Baesena, Pudil, & Vanthienen,
2005). Many researchers have developed many methods for credit
scoring. The computation of some model takes very long time. As a
consequence, research works are being continued on reducing the
computational method, increasing the overall efficiency of the
credit scoring model. If it is needed much time to create a credit
scoring model, then it is unworthy and it reduces the profit of
the firm. Moreover the efficiency of the firm has been decreased.
Computational time has a great significance for credit scoring
model. Lower the computational time, it will be more efficient. In
this study a new strategy has been proposed that reduces the com-
putational time for credit scoring. Creditor takes many character-
istics of the applicants for creating a credit scoring model. Some
characteristics may not be so useful like other characteristics. If
the creditor includes these characteristics, the computational time
will be automatically increased. Sometimes the previous data are
very large in volume. If the creditor takes the whole data set, the
computational time will be very high. In this paper, we have tried
to reduce the computational time by reducing the size of the data
set to be considered for the computation and optimizing the
characteristics (i.e. considering only the suitable characteristics
from all characteristics) as well as our other focus is reduce the
deviation of the result in our model from the actual result obtained
from the whole dataset considering all characteristics.
Many modern data mining techniques are used for credit scoring
models. For the last two decades various researchers has developed
numerous data mining tools and statistical methods for credit scor-
ing. Some of the methods are linear discriminant models (Reichert,
Cho, & Wagner, 1983), logistic regression models (Henley, 1995),
k-nearest neighborhood models (Henley & Hand, 1996), neural
network models (Desai, Crook, & Overstreet, 1996; Malhotra &
Malhotra, 2002; West, 2000) genetic programming models (Ong,
Huang, & Tzeng, 2005; Koza, 1992). Chen and Huang (2003)
presented a work to the credit industry that demostrates the advan-
tages of Neural network and Genetic algorithm to credit analysis.).
Each study has shown some comparative result with other methods.
Neural network model is seen to be more effective in credit risk pre-
diction comparative to other methods (Tam & Kiang, 1992). Logistic
regressions, K-nearest neighborhood method for credit scoring also
have a great significance in the view of accuracy and efficiency. Con-
sequently, Neural network models are treated as the benchmark in
the view of accuracy and solutions. But to get a good solution for a
large dataset, more number of hidden layers is required in the neural
network model. Consequently, the computational time is very high.
So the efficiency is low.
Recent research works have focused on increasing the accuracy
of the credit scoring model and developing some advanced meth-
ods. For example, Ho mann, Baesens, Martens, Put, and Vanthienen
(2002) suggested a neuro fuzzy and a genetic fuzzy classifier. A
integrated model based on clustering and neural networks for
credit scoring was suggested by Hsieh (2005; Garson, 1991; Zhang,
2000). A hybrid system with artificial networks and multivariate
adaptive regression splines was proposed by Lee and Chen
(2005).There are more hybrid models. Lee, Chiu, Lu, and Chen
(2002) suggested a hybrid credit scoring model with neural net-
works and discrimininant analysis. Recent research work involves
integrating various artificial intelligence methods to data mining
approach to increase the accuracy and flexibility.
There are many data mining and statistical approach for cluster-
ing. Support vector machine (SVM) is an important data mining
tool which is being used for clustering, classification etc. Support
vector machine was first proposed by Vapnik (1995). After that re-
searches have been done to apply this SVM tool in a wide range of
many applications. Now SVM is used in many applications like
clustering of data, pattern reorganization, text categorization, bio-
statistics etc. A simple decomposition method for support vector
machine is illustrated by Hsu, Chang, and Lin (2003). In credit scor-
ing model classification is necessary. SVM model is also used in
credit scoring model for making decision. In recent past few years
there has been some comparative as well as hybrid approach stud-
ies between SVM and other computational approach. A compara-
tive study between SVM and neural network model was done by
Huang, Chen, Hsu, Chen, and Wu (2004). An integrated SVM model
with genetic algorithm was suggested by Huang, Chen, and Wang
(2007); Fro¨hlich and Chapelle (2003); Pontil and Verri (1998). They
have shown that the result is comparable to benchmark methods.
All the previous studies about credit scoring are discussed about
the accuracy of the model. Most probably no previous study
discussed about the computational time. 21st century is for high
performance computing. Everybody is conscious about the compu-
tational time of the method. Any method with low computational
time is much more efficient and thus gives more profit to the
company. There is not any concrete study to include to sampling
methodology in SVM for credit scoring model with the integrating
F score (Weston et al., 2001). In this study a new approach has been
proposed about the reduction in computational time. In this paper,
we have taken a stratified sample and we have done the credit scor-
ing model with SVM and F score. Then we have done a comparative
study with computation of the whole data and all characteristics.
This study is incorporated with the concept of reduction of
unnecessary features with the calculation of F score. The reduction
of features from the calculation of the sample data takes less time
than the reduction of features from the calculation of whole data-
set (Kohavi & John, 1997). Here the computational time is reduced
for the two reasons: eliminating the unnecessary features and tak-
ing a sample instead of considering the whole sample It is shown in
that paper that with reduction of features from the data set from
the stratified sample gives similar accuracy with other benchmark
methods. As a reduction of features and reduction of size of data,
the computational time decreases significantly.
The paper organized as follows: Section 2 briefly describes the
procedure of support vector model. The concept of sampling method
and stratified sampling are described in Section 3.In Section 4, new
strategy for reduction of computation time is discussed. In Section 5,
empirical results are shown for real data set and comparative study
is done here. In Section 6, the remarks and conclusion are drawn.
2. Concepts of support vector machine (SVM) classifier
SVM classifier was most probably first proposed by Vapnik
(1995). Here we have illustrated the concept of SVM (support vector
machine) and its application as a two class classifier. Basically SVM is
a supervised learning method that analyzes data and recognizes pat-
terns, used for statistical classification and regression analysis.
Suppose there are given some dataset of pairs (xi,di), i =
1,2,3,. . . ,n where xi 2 Rn
and di 2 {À1,+ 1}. The value of di helps us
to indicate the class to which the point xi belongs. Each xi is actually
a p-dimensional real vector. With the help of we can find out the
maximum margin hyper-plane that divides the points having
di = 1 from those having di = À1. For this reason SVM is also known
as maximum margin classifier. Any hyperplane can be written as
the set of points x satisfying
r Á x À b ¼ 0 ð1Þ
where Á denotes the dot product and the vector r is a normal
vector which is perpendicular to the hyperplane. The offset of the
A.B. Hens, M.K. Tiwari / Expert Systems with Applications 39 (2012) 6774–6781 6775
3. hyperplane from the origin along the normal vector n can be
determined by the parameter b
krk
. We have to choose the n and b to
maximize the margin. The margin is defined as the distance between
the parallel hyperplanes that are as far apart as possible while still
classifying the data. These hyperplanes are expressed by the follow-
ing equations:
r Á x À b ¼ 1 ð2Þ
and
r Á x À b ¼ À1 ð3Þ
Our main objective is to
Minimize krk subject to constraint
diðr:xi À bÞ P 1 for i ¼ f1; 2; . . . ; ng
ð4Þ
To execute optimization problem in (4) we require the square root
computation for finding out the norm. Sometimes we may face dif-
ficulty in the square root computation. For this reason, the form of
the optimization problem in (4) has been changed to
Minimize
1
2
rT
r
Subject to diðr:xi À bÞ P 1 for i ¼ f1; 2; . . . ; ng
ð5Þ
This optimization problem in (5) is clearly a quadratic optimization
problem. In this optimization problem, the saddle point has to be
found out. The optimization problem in (5) can also be written as
min
r;b
max
a
1
2
rT
r À
Xn
i¼1
aiðdiðr:xi À bÞ À 1Þ
( )
ð6Þ
where ai is the Lagrange multiplier, hence ai > =0. The solution of
this optimization problem can be expressed by terms of linear com-
bination of the training vectors as
r ¼
Xn
i¼1
aixidi ð7Þ
For the sake of computation we perform differentiation with re-
spect to r and b, as well as we introduce the Karush–Kuhn–Tucker
(KKT) condition (Mao, 2004). As a result, the expression in (7) can
be transformed to the dual Lagrangian LD(a):
Max
a
LDðaÞ ¼
Xn
i¼1
ai À
1
2
Xn
i;j¼1
aiajdidj < xi Á xj > ð8Þ
Subject to: ai P 0 for i = 1,2,. . .,n and
Pn
i¼1aidi ¼ 0
The solution ai determines the parameters r⁄
and b⁄
of the opti-
mal hyperplane in the case of dual optimization problem. Thus we
can express the optimal hyperplane decision function as
fðx; aÃ
; b
Ã
Þ ¼ sgn
Xn
i¼1
diaÃ
i < xi Á xj > þb
Ã
!
ð9Þ
But in reality, only a small number of Lagrange multipliers usually
tend to be positive and these vectors are in the proximity of the
optimal hyperplane. The respective training vectors to these posi-
tive lagrange multipliers are support vectors. The optimal hyper-
plane f(x,a⁄
,b⁄
) depends on these support vectors.
The extension of the above concept can be found in nonseper-
able case (i.e. linear generalized support vector machine). In the
purpose of minimum number of training error, the problem of
finding the hyperplane for these induced hyperplane has the fol-
lowing expression:
Min
r;b;n
1
2
rT
r þ C
Xn
i¼1
ni
subject to : dið< r:xi > þbÞ þ ni À 1 P 0 and ni P 0
ð10Þ
where ni is the positive slack variables and C is a penalty parameter
for training error during validation on test data set. The optimization
problem in (10) can be solved with the help of Lagrangian method.
The corresponding optimization problem can be written as:
Max
a
Xn
i¼1
ai À
1
2
Xn
i;j¼1
aiajdidj < xi:xj >
Subject to : 0 6 ai 6 C for i ¼ 1; 2; . . . ; n and
Xn
i¼1
aidi ¼ 0
ð11Þ
The user defined penalty parameter C is the upper bound of ai..
The nonlinear Support vector machine helps for the mapping of
the training samples from the input space into a higher dimen-
sional feature space via a mapping function U. The inner product
has to be replaced by kernel function as the following expression
(Scholkopt & Smola, 2000):
ðUðxiÞ Á UðxjÞÞ ¼ kðxi;xjÞ ð12Þ
Then the equation will be transformed to:
Max
a
Xn
i¼1
ai À
1
2
Xn
i;j¼1
aiajdidjkðxi; xjÞ
Subject to : 0 6 ai 6 C for i ¼ 1; 2; . . . ; n and
Xn
i¼1
aidi ¼ 0
ð13Þ
For the linear generalized case, the decision function is as following
expression:
fðx; aÃ
; b
Ã
Þ ¼ sgn
Xn
i¼1
diaÃ
i kðxi; xjÞ þ b
Ã
!
ð14Þ
3. Concept of stratified sampling
Stratification is the process of grouping members of the popula-
tion into relatively homogeneous subgroups before sampling. The
strata should be mutually exclusive: every element in the popula-
tion must be assigned to only one stratum. The strata should also
be collectively exhaustive: no population element can be excluded.
Then random or systematic sampling is applied within each stra-
tum. This often improves the representativeness of the sample by
reducing sampling error. Thus stratified sampling is one of the best
sampling methods that represent the characteristics of the data. It
has been proved that the variation of result with the stratified sam-
pling is less than most of the sampling processes.
In Stratified sampling the population of N units in first divided
of into subpopulation of N1, N2,. . .,NL units respectively. These sub-
populations are overlapping and together they comprise the whole
of the population i.e.
N1 þ N2 þ Á Á Á þ NL ¼ N ð15Þ
These subpopulations are called as strata. When the strata have
been determined, a sample is drawn from each, the drawings being
made independently in different strata. The sample sizes within the
strata are denoted by n1,n2,. . .,nL, respectively such that
n1 þ n2 þ Á Á Á þ nL ¼ n ð16Þ
where n is the total sample size.
In this study proportionate stratified sampling method is uti-
lized. The main criteria of proportionate stratified sampling is:
nh
n
¼
Nh
N
ð18Þ
where Nh denotes the total number of units in the hth strata and nh
denotes the total number of units drawn from the hth strata as a
sample. From the Eq. (18) it is clear that if we have taken the Nh
6776 A.B. Hens, M.K. Tiwari / Expert Systems with Applications 39 (2012) 6774–6781
4. to be almost equal for all the strata, so the all the nh should be
nearly equal in magnitude.
4. Reduction of computational time by using F score and
sampling approach
4.1. Sample selection form the data
In this study we use proportionate stratified sampling approach
because the dataset is homogeneous. We know that stratified sam-
pling is applicable to homogeneous data. It is one of the most well
known sampling approaches for creating a good sample which rep-
resent the characteristics of the dataset. First, we have created
some strata which contains almost equal number of data i.e. some
bins containing almost equal number of data. Then randomly se-
lect some defined percentage of data from each stratum. Then
the selected data are put together to generate the sample. This is
the method for generating a sample using stratified sampling. Here
we also have taken previous dataset and make some strata of al-
most equal sizes. Then we create the sample using the stratified
sampling approach.
4.2. Sorting the input features using F score
The F score computation was first proposed by Chen and Lin
(2005). F score computation is a very simple technique but it is
very much efficient for the measurement the discrimination of
two sets of real numbers. First of all, we will find out the F score
for every feature from the sample. Suppose the training vectors
are xk (k = 1,2,. . .,m). There are certain numbers of positive and
negative instances. We denote the number of positive instances
as s+ and the number of negative instances as sÀ respectively. We
denote xi as the averages of the ith feature of the whole dataset
whereas x
ðþÞ
i and x
ðÀÞ
i denote averages of the ith feature of the po-
sitive and ith feature onegative datasets respectively. The ith fea-
ture of kth positive instance and ith feature of kth negative
instances can be denoted as x
ðþÞ
k;i and x
ðÀÞ
k;i respectively. Then the F
score of every feature can be computed as the following
expression:
Fi ¼
x
ðþÞ
i À xi
2
þ x
ðÀÞ
i À xi
2
1
sþÀ1
Pnþ
k¼1 x
ðþÞ
k;i À x
ðþÞ
i
2
þ 1
sÀÀ1
PnÀ
k¼1 x
ðÀÞ
k;i À x
ðÀÞ
i
2
ð19Þ
In the expression of the F score above, the numerator signifies the
discrimination between the positive and negative sets. The denom-
inator of the F score is summation of the sample variance of the po-
sitive sets and the negative sets. The larger value of F score implies
that corresponding feature is more discriminative. By the following
expression of F score we can easily compute F score of every feature
from the sample of the credit dataset.
4.3. Reduction of features
Suppose there are m features. All features may not provide their
impact for making decision in credit scoring. If we consider all the
unnecessary features for the computation of credit scoring, it will
take a long time. So it is unworthy. If we delete those unnecessary
features from the dataset, computational time makes fast and it re-
duces the complicacy of the computation. We sorted the features
in descending order according to their F score taken calculated
from sample and the features are given rank from 1 to m. We select
the cutoff feature by the following method described below. After
getting the cut off feature we will take only the features from the
feature ranked as 1 to the cut off feature for further calculation.
The cut off feature can be found out by the following procedure:
Step 1. Calculate the F score of every feature form the sample.
Step 2. Sort the F score in the descending order and give them
rank from 1 to m.
Step 3. Select the feature from rank 1 to rank k(k = 1,2,. . .,m). For
each case go to step 4.
Step 4.
(a) Randomly split the training data into training dataset
and validation dataset using 10-fold cross validation.
For each fold the following steps are done
(b) Suppose for a particular fold Dt be the training data
and Dv be the validation data. Using the SVM proce-
dure the predictor has to be found out from Dt. The
obtained predictor from SVM has to be applied on Dv
to follow the performance.
(c) Calculate the average validation accuracy for the 10-
fold cross validation. Store this result.
Step 5. Plot all the results.
Step 6. The plot has to be analyzed carefully. From which point
onwards the fluctuation in the accuracy are less and the
accuracy are nearby the value of the accuracy of taking
all the features. That particular feature is to be considered
as the cut off feature. The next features of the cut off fea-
ture and onwards are not to be considered in the compu-
tation for further SVM model calculation as these
features do not show their importance the result.
Henceforth, if one can get another dataset of similar types, he can
easily drop out the some unnecessary features. The total computation
SVM integrated with F score ranking to be done on the sample. The size
of the sample has to be determined by the creditor. It has been experi-
enced that the computational time has been reduced significantly. To
verifytheresultobtainedfromthenewmethoddescribedinthispaper,
we have done the calculation on the whole dataset using all features. It
has been shown that the accuracy obtained in our model is almost
same to the accuracy obtained from the computation on the whole
dataset and all features for both the datasets. Thus the computation
makes faster, easy and of less computational time.
5. Empirical analysis
5.1. Credit data sets
We have taken two real world data sets for the analysis in this
study. These are Australian dataset and German dataset. The
sources of these two datasets are the UCI Repository of machine
learning databases. Total number of instances in the German data-
set is 1000. Each of the instances has 24 attributes and one class
attribute. 700 instances of the whole dataset are creditworthy
applicants and there are 300 instances where credit is not worthy.
On the other hand the total number of instances in the Australian
dataset is 690. Each of the instances has 6 nominal, 8 numeric attri-
butes and one class attributes. 307 instances of the whole dataset
are creditworthy and the rest 383 instances imply that credit is not
creditworthy for these instances. Later we can see that we have got
some interesting results from the analysis of these two datasets.
Characteristics of Australian and German real dataset has been pre-
sented in Table 1. The German dataset is more unbalanced than
Australian dataset (see Table 1).
5.2. Experimental result for Australian dataset
In this study, we have taken a 25% stratified sample of the Aus-
tralian dataset. In the Australian dataset, there are 14 features. First
we have calculated the F score of each attribute from the sample.
We have calculated the F score for each attribute from the whole
A.B. Hens, M.K. Tiwari / Expert Systems with Applications 39 (2012) 6774–6781 6777
5. data set also. We can see form the Table 2 that the F score for both
the sample and the whole data sets are almost same and the rank-
ing of the features in the descending order for both cases are al-
most same.
In Fig. 1 we can easily see that the F scores for both the actual
and the sample data. It is clear from the Fig. 1 that the F score
for both actual data and the sample data are almost same. So, the
discrimination of the attributes in the actual dataset and the sam-
ple dataset are similar. The entire features are not so much neces-
sary for the computation of the credit scoring. Some features have
a little importance in the model. We can eliminate those features
to reduce the computational time. The deviation of accuracy for
elimination of some feature does not show their impact in the re-
sult. So we have to eliminate those features.
At first, we have sorted the features in descending order
according to their F score from the sample. Then we rank them
from 1 to 14. Then we take the k features starting from rank
1(k = 1,2,3,. . .,14), and perform 10 cross validation using support
vector model procedure and evaluated the average accuracy rate
in 10-fold cross validation. Then we have to plot the average accu-
racy rate for each case. The accuracy for each case is shown in Ta-
ble 3. From the characteristics of the plot, we can decide the cut off
feature. We have to choose that particular feature in the plot from
that and onwards the average accuracy rate in SVM are not fluctu-
ating so much. From the Fig. 2, we can select the cut off feature of
the Australian dataset as to rank 9. So we will not consider the fea-
tures of rank 10 and onwards as adding these features do not show
their impact for the average accuracy rate of the SVM computation.
Thus, this reduction of features reduces our computational time
significantly as runtime complexity of SVM procedure is more than
linear time multiplicity (i.e. O(nk
), k 1 and n = number of data
points in the dataset). We have verified the runtime in MATLAB.
We see that the runtime reduces significantly. To verify cut off fea-
ture of our result in the stratified sample, we have again perform
the same thing for the whole data like the sample data. But in this
time the ranking is by the F score in the actual data not according
to the sample data. Then we have plotted (in Fig. 2) the average
accuracy rate in the SVM procedure. It is clear from the plot that
the accuracy rates from 9 and onwards are almost same with a lit-
tle variance. So, we do not consider the feature 10 and onwards for
our computation.
5.3. Experimental result for German dataset
Like the Australian dataset, we have taken a 25% stratified sam-
ple of the German dataset. In the German dataset, there are 24 fea-
tures. First we have calculated the F score of each attribute from
the sample. We have calculated the F score for each attribute from
the whole data set also. We can see form the Table 4 that the F
score for both the sample and the whole data sets are almost same
and we have also noticed that the ranking of the features in the
descending order in both cases are almost same.
In Fig. 3 we can easily see that the F scores for both the actual
and the sample data. It is clear from the Fig. 1 that the F score
for both actual data and the sample data are almost same. So, the
discrimination of the attributes in the actual dataset and the sam-
ple dataset are similar. As the entire feature are not so much
necessary for the computation of the credit scoring. Some features
have a little importance in the model. The variation in the value of
these features does not have much impact. These unnecessary fea-
tures are not considered to reduce the computational time. The
deviation of accuracy after elimination of some feature does not
show their impact in the result. We have eliminated those features.
At first, we have sorted the features in descending order accord-
ing to their F score from the sample. Then we rank them from 1 to
14.Then we take the k features starting from rank 1(k = 1,2,3,
. . .,24), and perform 10 cross validation using support vector
model procedure and evaluated the average accuracy rate in
10-fold cross validation. The result is shown in Table 5. Then we
have to plot the average accuracy rate for each case. From the
characteristics of the plot, we can decide the cut off features. We
have to choose that feature in the plot from that and onwards
the average accuracy rate in SVM are not fluctuating so much
and the average accuracy of that feature is almost same to the
accuracy for the case of considering all the features, shown in
Table 6 reveals those features which to be considered for further
computation in Australian and German dataset.
From the Fig. 4, we can select the cut off feature of the German
dataset as to rank 13. So we will not consider the features of rank
14 and onwards as adding these features do not show their impact
for the average accuracy rate of the SVM computation. Thus, this
reduction of features reduces our computational time significantly
as runtime complexity of SVM procedure is more than linear time
multiplicity (i.e. O(nk
), k 1 and n = number of data points in the
dataset). We have verified the runtime in MATLAB. We see that
the runtime reduces significantly. To verify cut off feature of our
Table 1
Characteristics of two real dataset.
Data set No of
classes
No of
instances
Nominal
feature
Numeric
feature
Total
features
Australian 2 690 6 8 14
German 2 1000 0 8 24
Table 2
F score for the features in the Australian dataset for the sample data and the whole
dataset.
Feature 1 2 3 4 5 6 7
Actual
data
0.0002 0.0269 0.0443 0.0411 0.1662 0.0658 0.1110
Sample
data
0.0023 0.1014 0.0059 0.0438 0.1723 0.1209 0.1338
Feature 8 9 10 11 12 13 14
Actual
data
1.1506 0.2683 0.1835 0.0010 0.0141 0.0104 0.0290
Sample
data
1.2468 0.2777 0.1631 0.0000 0.0194 0.0115 0.0653
Fig. 1. F score plot for sample data and actual data of Australian dataset.
6778 A.B. Hens, M.K. Tiwari / Expert Systems with Applications 39 (2012) 6774–6781
6. result in the stratified sample, we have again perform the same
thing for the whole data like the sample data. But in this
time the ranking is by the F score in the actual data not according
to the sample data. Then we have plotted the average accuracy
rate in the SVM procedure. It is clear from the plot that the accu-
racy rates from 13 and onwards is almost same with a little vari-
ance. So, we do not consider the feature 14 and onwards for our
computation.
5.4. Comparison to other methods in computational time
In previous studies, several approaches like genetic program-
ming, neural network, support vector model, decision tree, SVM
based genetic algorithm has been applied to credit scoring (Quin-
lan, 1986). All these models are performing well for credit scoring.
In neural network model, one may not get expected result at the
first time. In order to get expected result, the numbers of hidden
layers have to be optimized. Larger the number of hidden layers
in neural network model, the flexibility of the neural network
model will be larger because then the network has more parameter
to optimize. Consequently, the accuracy level will be high. But with
Table 3
Accuracy rate for SVM method for sample data and whole data of Australian dataset.
No of features in the sorted list 1 2 3 4 5 6 7
Sample data 0.8753 0.8378 0.8225 0.8661 0.8851 0.8349 0.8669
Actual data 0.8530 0.8716 0.8623 0.8667 0.8598 0.8554 0.8559
no of features in the sorted list 8 9 10 11 12 13 14
Sample data 0.8661 0.8598 0.8487 0.8536 0.8631 0.8516 0.8683
Actual data 0.8521 0.8723 0.8701 0.8658 0.8693 0.8619 0.8710
Fig. 2. Accuracy for sample data and the actual data of Australian dataset.
Table 4
F score of features of the sample data and the actual dataset for the German dataset.
Feature 1 2 3 4 5 6 7 8
Sample data 0.2022 0.0153 0.0869 0.0197 0.0336 0.0475 0.0033 0.0015
Actual data 0.2055 0.0619 0.0752 0.0286 0.0498 0.0186 0.0105 0
feature 9 10 11 12 13 14 15 16
Sample data 0.0131 0.0004 0.0177 0.0183 0.0119 0.0001 0.0009 0.0389
Actual data 0.0285 0.0116 0.0156 0.0029 0 0.0018 0.0114 0.0124
feature 17 18 19 20 21 22 23 24
Sample data 0.0314 0.0098 0.0016 0.0113 0.0167 0.0086 0.0014 0.0055
Actual data 0.0157 0 0.0048 0.0112 0.0242 0 0.0007 0.0003
Fig. 3. Plot of the F score from actual data and sample data of German dataset.
A.B. Hens, M.K. Tiwari / Expert Systems with Applications 39 (2012) 6774–6781 6779
7. the increase of hidden layers, the runtime will be also very high.
Hence, for a large database, credit scoring based on neural network
model will take a long time. Genetic programming also will take a
long time because at first the function set is initialized. We will get
the terminal set from the dataset. If we will take function sets of
many functions, it will have more flexibility. Thus, the accuracy
will be high. But at the same time, with a large function set and
a large dataset the computational time will be much high. Both
the GP and BPN are stochastic process. So it will take a long com-
putational time.
But support vector model is a quadratic optimization procedure.
Here user does not require to specify any parameter beforehand. Its
runtime is faster than the BPN, GP model. While the GA based SVM
model (Huang et al. 2005) is time costly than simple SVM model.
The time complexity for SVM model is more than linear time com-
plexity (i.e. O(nk
), k 1 and n = number of data points in the data-
set). In this study, the runtime decreased significantly because of
taking sample data instead of whole data set and reduction of
unnecessary features using F score. The significant decrease in run-
time for the execution of the SVM model with less data sets and
less number of features can be experienced easily in MATLAB.
We have eliminated the features using the F score calculation from
sample rather than the calculation from the whole data set.
Overall, the runtime is less than other methods. Moreover, the
procedure of discussed in this paper produces similar accuracy like
other benchmark methods.
5.5. Comparison to other method in accuracy efficiency
Then we measured the accuracy rate for the four procedure and
we compared our sampling based SVM procedure with other
methods. The comparative results for two data set s are shown in
Tables 7 and 8. It is clear from the table our fast computation
Table 5
Accuracy rate of SVM for actual data and sample data from German dataset.
No of features in the sorted list 1 2 3 4 5 6 7 8
Sample data 0.7281 0.7314 0.7815 0.7753 0.8042 0.7687 0.7829 0.7775
Actual data 0.6912 0.7129 0.7499 0.7517 0.7526 0.7562 0.7689 0.7570
No of features in the sorted list 9 10 11 12 13 14 15 16
Sample data 0.7747 0.7674 0.7513 0.7675 0.7508 0.7431 0.7519 0.7423
Actual data 0.7581 0.7657 0.7632 0.7670 0.7674 0.7560 0.7613 0.7736
No of features in the sorted list 17 18 19 20 21 22 23 24
Sample data 0.7611 0.7468 0.7608 0.7478 0.7537 0.7637 0.7586 0.7560
Actual data 0.7739 0.7768 0.7591 0.7689 0.7651 0.7521 0.7732 0.7716
Table 6
Features to be considered for further computation in two dataset.
Dataset Total features Features to be considered
for further computation
Australian dataset 14 9
German dataset 24 13
Fig. 4. Plot of accuracy for sample and actual data set for German dataset.
Table 7
Comparative study for various methods for Australian dataset.
Method Accuracy rate (in %) Std (in %)
Australian data
Sampling procedure in SVM 85.98 3.89
SVM + GA 86.76 3.78
BPN 86.71 3.41
GP 86.83 3.96
Table 8
Comparative study for various methods for German dataset.
Method Accuracy rate (in %) Std (in %)
German Data
Sampling procedure in SVM 75.08 3.92
SVM + GA 76.84 3.82
BPN 76.69 3.24
GP 77.26 4.06
6780 A.B. Hens, M.K. Tiwari / Expert Systems with Applications 39 (2012) 6774–6781
8. method is well competitive with respect to other methods. The
average accuracy of the new method is almost same with other
methods.
6. Conclusion
Credit scoring is very much necessary for every bank to make
decision to a new applicant. Credit industry is increasing rapidly.
Banks want to get more profit by issuing more credit cards with
the constraint of minimum risk (i.e. minimizing the number of pos-
sible fraud applications). Credit scoring model helps to classify a
new applicant as accepted or rejected. There are various method
for creating a credit scoring model such as traditional statistical
techniques, neural networks, support vector model, genetic pro-
gramming etc. For statistical techniques the creditor should know
the relationship among the various features. But the other methods
like GP, BPN, SVM do not require any underlying relationship
among the features. For GP, the creditors have to specify the func-
tion set. For different function set the result may be different. For
BPN, the creditors have to specify the structure of the neural net-
work. More the number of hidden layers, the flexibility will be
high. Consequently the computational time will be high. BPN and
GP are the stochastic process, so it requires high amount of compu-
tational time. Comparatively, SVM is a quadratic optimization
problem. So, Creditor does not require to specify any parameter
for the computation. With respect to GP, the computational time
of SVM is less. Sometimes due to large size of dataset the compu-
tational time may be high but still is less than the computational
time of GP.
In this paper, we have suggested a hybrid approach of sam-
pling, SVM and F score. Our main objective is to study the
new approach and make a comparative analysis of the computa-
tional time and accuracy with some other methods. Here we
summarize the whole method. From the whole dataset, creditor
have to take a stratified sample as the stratified sample is one of
the best sampling method that represents the characteristics of
the dataset. The determination of sample size is very important.
Sample size is to be predicted in such a way that study of the
sample nearly represents the result obtained from the study of
whole dataset. Depending on the size of the dataset, creditors
have to predict the size of the sample that will reduce the devi-
ation from the computational result of actual dataset. After cal-
culating the F score from the sample, the features have to be
sorted in descending order according to their F score. F score
actually signifies the importance of each feature in the dataset.
Then perform the SVM as described in Section 4. From the plot
one can easily eliminate the features which are not needed for
future calculation. Initially, the differences in the accuracy re-
sults for consecutive results are higher than later because there
may be some correlation between features. By selecting some
features, there may be some unselected feature which is corre-
lated to some selected feature. But after certain steps the results
becomes comparatively stable. From that step we can find out
the unnecessary features. These unnecessary features unneces-
sarily increase the runtime of the computation. Thus by decreas-
ing the features from the computation from the sample is less
costly than the same from the computation from the whole
dataset. Thus new method for credit scoring which is proposed
in this study, is very much computationally efficient as well as
reliable in the view of its accuracy.
References
Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., Vanthienen, J.
(2003). Benchmarking state-of-the-art classification algorithms for credit
scoring. Journal of the Operational Research Society, 54(6), 627–635.
Chen, Y.-W., Lin, C.-J. (2005). Combining SVMs with various feature selection
strategies. Available from http://www.csie.ntu.edu.tw/$cjlin/papers/
features.pdf.
Chen, M. C., Huang, S. H. (2003). Credit scoring and rejected instances reassigning
through evolutionary computation techniques. Expert Systems with Applications,
24(4), 433–441.
Desai, V. S., Crook, J. N., Overstreet, G. A. (1996). A comparison of neural networks
and linear scoring models in the credit union environment. European Journal of
Operational Research, 95(1), 24–37.
Fro¨hlich, H., Chapelle, O. (2003), Feature selection for support vector machines by
means of genetic algorithms. In Proceedings of the 15th IEEE international
conference on tools with artificial intelligence, Sacramento, California, USA, (pp.
142–148).
Garson, G. D. (1991). Interpreting neural-network connection weights. AI Expert,
6(4), 47–51.
Henley, W. E. (1995), Statistical aspects of credit scoring. Dissertation, The Open
University, Milton Keynes, UK.
Henley, W. E., Hand, D. J. (1996). A k-nearest neighbor classifier for assessing
consumer credit risk. Statistician, 44(1), 77–95.
Ho mann, F., Baesens, B., Martens, J., Put, F., Vanthienen, J. (2002). Comparing a
genetic fuzzy and a neurofuzzy classifier for credit scoring. International Journal
of Intelligent Systems, 17(11), 1067–1083.
Hsieh, N.-C. (2005). Hybrid mining approach in the design of credit scoring models.
Expert Systems with Applications, 28(4), 655–665.
Hsu, C. W., Chang, C. C., Lin, C. J. (2003), A practical guide to support vector
classification. Available from http://www.csie.ntu.edu.tw/$cjlin/papers/
guide/guide.pdf.
Huang, Z., Chen, H., Hsu, C.-J., Chen, W.-H., Wu, S. (2004). Credit rating analysis
with support vector machines and neural networks: A market comparative
study. Decision Support Systems, 37(4), 543–558.
Huang, C. L., Chen, M. C., Wang, J. W. (2007). Credit scoring with a data mining
approach based on support vector machine. Expert Systems with Applications,
33(2007), 847–856.
Kohavi, R., John, G. (1997). Wrappers for feature subset selection. Artificial
Intelligence, 97(1–2), 273–324.
Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Means
of Natural Selection. Cambridge, MA: The MIT Press.
Lee, T.-S., Chen, I.-F. (2005). A two-stage hybrid credit scoring model using
artificial neural networks and multivariate adaptive regression splines. Expert
Systems with Applications, 28(4), 743–752.
Lee, T.-S., Chiu, C.-C., Lu, C.-J., Chen, I.-F. (2002). Credit scoring using the hybrid
neural discriminant technique. Expert Systems with Applications, 23(3), 245–254.
Malhotra, R., Malhotra, D. K. (2002). Differentiating between good credits and bad
credits using neuro-fuzzy systems. European Journal of Operational Research,
136(1), 190–211.
Mao, K. Z. (2004). Feature subset selection for support vector machines through
discriminative function pruning analysis. IEEE Transactions on Systems, Man, and
Cybernetics, 34(1), 60–67.
Ong, C.-S., Huang, J.-J., Tzeng, G.-H. (2005). Building credit scoring models using
genetic programming. Expert Systems with Applications, 29(1), 41–47.
Pontil, M., Verri, A. (1998). Support vector machines for 3D object recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6), 637–646.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Reichert, A. K., Cho, C. C., Wagner, G. M. (1983). An examination of the conceptual
issues involved in developing credit-scoring models. Journal of Business and
Economic Statistics, 1(2), 101–114.
Scholkopf, B., Smola, A. J. (2000). Statistical Learning and Kernel Methods.
Cambridge, MA: MIT Press.
Somol, P., Baesens, B., Pudil, P., Vanthienen, J. (2005). Filter-versus wrapper-based
feature selection for credit scoring. International Journal of Intelligent Systems,
20(10), 985–999.
Tam, K. Y., Kiang, M. Y. (1992). Managerial applications of the neural networks:
The case of bank failure prediction. Management Science, 38(7), 926–947.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. New York: Springer-
Verlag.
West, D. (2000). Neural network credit scoring models. Computers and Operations
Research, 27(11–12), 1131–1152.
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., Vapnik, V. (2001).
Feature Selection for SVM. In S. A. Solla, T. K. Leen, K.-R. Muller (Eds.).
Advances in Neural Information Processing Systems (Vol. 13, pp. 668–674).
Cambridge, MA: MIT Press.
Zhang, G. P. (2000). Neural networks for classification: a survey. IEEE Transactions on
Systems, Man, and Cybernetics—Part C: Applications and Reviews, 30(4), 451–462.
A.B. Hens, M.K. Tiwari / Expert Systems with Applications 39 (2012) 6774–6781 6781