This document discusses using a hybrid modeling approach combining decision trees and logistic regression to develop a credit scoring model for retail loans. It presents a case study where this approach was tested on real loan data. Key points:
- A decision tree was first used to identify important borrower characteristics and assign initial weights. Logistic regression was then used to compute odds ratios and assign refined weights within each characteristic.
- This iterative process determined weights for both numeric factors like time at bank and non-numeric factors like occupation. Binning of numeric variables was done automatically using additional decision trees.
- When tested on loan application data, the hybrid model achieved a classification accuracy of around 80%, higher than single models. This approach provides an
In this paper I compare a conventional classification regression method with the state of the art machine learning technique XGBoost. This results in a major performance gain in terms of classification and expected loss.
The document discusses supervised versus unsupervised discretization methods for transforming variables in cluster analysis models. It finds that unsupervised, or SAS-defined, transformations generally result in more profitable models compared to supervised, or user-defined, transformations. However, the most profitable transformations can be complex and difficult to explain. There is a tradeoff between profitability and interpretability, known as the "cost of simplicity." The document analyzes different variable transformations applied to a credit risk prediction model to determine which balance of profit and explanation is most appropriate.
This binary classification model uses logistic regression to predict customer credit risk and maximize profit. The dataset was merged and cleaned. The dependent variable was transformed into a binary variable indicating good or bad credit risk. Some variables had coded non-numeric values that were replaced. The model was developed using variable transformations and selections to create a profitable clustering of customers.
Predicting the Credit Defaulter is a perilous task of Financial Industries like Banks. Ascertainingnon payer
before giving loan is a significant and conflict-ridden task of the Banker. Classification techniques
are the better choice for predictive analysis like finding the claimant, whether he/she is an unpretentious
customer or a cheat. Defining the outstanding classifier is a risky assignment for any industrialist like a
banker. This allow computer science researchers to drill down efficient research works through evaluating
different classifiers and finding out the best classifier for such predictive problems. This research
work investigates the productivity of LADTree Classifier and REPTree Classifier for the credit risk prediction
and compares their fitness through various measures. German credit dataset has been taken and used
to predict the credit risk with a help of open source machine learning tool.
1) The document discusses various data mining techniques applied to student enrollment data to identify prospective students most likely to enroll and inform marketing strategy.
2) Key variables like self-initiated contacts, high school rating, and SAT score were found to be important predictors of enrollment.
3) Different models like decision trees, logistic regression, and neural networks were compared, with forward regression found to have the best performance on the validation data based on its misclassification rate.
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...inventionjournals
International Journal of Mathematics and Statistics Invention (IJMSI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJMSI publishes research articles and reviews within the whole field Mathematics and Statistics, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Case Study on Placement Solution to AS Business School (Biswadeep Ghosh Hazra...Biswadeep Ghosh Hazra
We had to make a relevant placement strategy for a B-school taking into account a multitude of factors. The data set includes secondary and higher secondary school percentages and specialisations of unique students from the past. It also includes their UG specialisation, work experience and the salary offered to the placed students.
The strategy was arrived at by performing relevant Data Analysis on the given Dataset
Team Name & Details- "Art of War" comprised of myself, Devark Chauhan and Shivam Arora
Quantitative Analysis For Management 11th Edition Render Test BankRichmondere
Full download : http://alibabadownload.com/product/quantitative-analysis-for-management-11th-edition-render-test-bank/ Quantitative Analysis For Management 11th Edition Render Test Bank
In this paper I compare a conventional classification regression method with the state of the art machine learning technique XGBoost. This results in a major performance gain in terms of classification and expected loss.
The document discusses supervised versus unsupervised discretization methods for transforming variables in cluster analysis models. It finds that unsupervised, or SAS-defined, transformations generally result in more profitable models compared to supervised, or user-defined, transformations. However, the most profitable transformations can be complex and difficult to explain. There is a tradeoff between profitability and interpretability, known as the "cost of simplicity." The document analyzes different variable transformations applied to a credit risk prediction model to determine which balance of profit and explanation is most appropriate.
This binary classification model uses logistic regression to predict customer credit risk and maximize profit. The dataset was merged and cleaned. The dependent variable was transformed into a binary variable indicating good or bad credit risk. Some variables had coded non-numeric values that were replaced. The model was developed using variable transformations and selections to create a profitable clustering of customers.
Predicting the Credit Defaulter is a perilous task of Financial Industries like Banks. Ascertainingnon payer
before giving loan is a significant and conflict-ridden task of the Banker. Classification techniques
are the better choice for predictive analysis like finding the claimant, whether he/she is an unpretentious
customer or a cheat. Defining the outstanding classifier is a risky assignment for any industrialist like a
banker. This allow computer science researchers to drill down efficient research works through evaluating
different classifiers and finding out the best classifier for such predictive problems. This research
work investigates the productivity of LADTree Classifier and REPTree Classifier for the credit risk prediction
and compares their fitness through various measures. German credit dataset has been taken and used
to predict the credit risk with a help of open source machine learning tool.
1) The document discusses various data mining techniques applied to student enrollment data to identify prospective students most likely to enroll and inform marketing strategy.
2) Key variables like self-initiated contacts, high school rating, and SAT score were found to be important predictors of enrollment.
3) Different models like decision trees, logistic regression, and neural networks were compared, with forward regression found to have the best performance on the validation data based on its misclassification rate.
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...inventionjournals
International Journal of Mathematics and Statistics Invention (IJMSI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJMSI publishes research articles and reviews within the whole field Mathematics and Statistics, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Case Study on Placement Solution to AS Business School (Biswadeep Ghosh Hazra...Biswadeep Ghosh Hazra
We had to make a relevant placement strategy for a B-school taking into account a multitude of factors. The data set includes secondary and higher secondary school percentages and specialisations of unique students from the past. It also includes their UG specialisation, work experience and the salary offered to the placed students.
The strategy was arrived at by performing relevant Data Analysis on the given Dataset
Team Name & Details- "Art of War" comprised of myself, Devark Chauhan and Shivam Arora
Quantitative Analysis For Management 11th Edition Render Test BankRichmondere
Full download : http://alibabadownload.com/product/quantitative-analysis-for-management-11th-edition-render-test-bank/ Quantitative Analysis For Management 11th Edition Render Test Bank
This document proposes modifications to Pawlak's conflict theory model based on graph theory. It suggests developing the conflict analysis system to predict how the opinions of neutral agents may change over time. The approach involves:
1) Creating matrices to represent direct conflicts, alliances, and neutral relationships between agents.
2) Computing higher power matrices through multiplication to represent indirect relationships over increasing path lengths.
3) Weighting the matrices based on path length and summing values to predict if neutral relationships may become conflicts or alliances based on direct and indirect influences.
4) Optionally performing logical OR operations on conflict matrices to identify any direct or indirect conflicts between agents.
This document contains 54 multiple choice questions about probability concepts from the textbook "Quantitative Analysis for Management, 11e". The questions cover topics such as fundamental probability concepts, mutually exclusive and collectively exhaustive events, statistically independent events, probability distributions including binomial and normal distributions, and Bayes' theorem. For each question, the answer and difficulty level is provided along with the topic area.
Automation of IT Ticket Automation using NLP and Deep LearningPranov Mishra
Overview of Problem Solved: IT leverages Incident Management process to ensure Business Operations is never impacted. The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. Manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
Solution: Multiple deep learning sequential models with Glove Embeddings were attempted and results compared to arrive at the best model. The two best models are highlighted below through their results.
1. Bi-Directional LSTM attempted on the data set has given an accuracy of 71% and precision of 71%.
2. The accuracy and precision was further improved to 73% and 76% respectively when an ensemble of 7 Bi-LSTM was built.
I built a NLP based Deep Learning model to solve the above problem. Link below
https://github.com/Pranov1984/Application-of-NLP-in-Automated-Classification-of-ticket-routing?fbclid=IwAR3wgofJNMT1bIFxL3P3IoRC3BTuWmhw1SzAyRtHp8vvj9F2sKZdq67SjDA
A Fuzzy Mean-Variance-Skewness Portfolioselection Problem.inventionjournals
A fuzzy number is a normal and convex fuzzy subsetof the real line. In this paper, based on membership function, we redefine the concepts of mean and variance for fuzzy numbers. Furthermore, we propose the concept of skewness and prove some desirable properties. A fuzzy mean-variance-skewness portfolio se-lection model is formulated and two variations are given, which are transformed to nonlinear optimization models with polynomial ob-jective and constraint functions such that they can be solved analytically. Finally, we present some numerical examples to demonstrate the effectiveness of the proposed models
Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...TELKOMNIKA JOURNAL
Ranking of an activity is very important to support work effectiveness. Previous works, ranking for distribution product is used by manual process or averaging value. Problem in this research, the research should be found the effective way to rank the distribution product. This research proposes assist the ranking with a computational model based on Fuzzy Multiple Decision Making (FMADM). Getting an effective ranking, a variable in FMADM computing is required. Variables is used in this research such as number of households, number of small-scale enterprises run by households, gross domestic regional income, and economic growth rate of a region. Research completion is assisted by using self-built research methods. Research method consists of determining value of origin, determining degree of membership, determining weight of each variable, calculation of relation matrix, calculation of the preference value in each village for ranking value, and last is sorting. Operationalized FMADM is gain a result with three priorities district. Priority number one is all of district that have a rank or Vij (alternative rank) higher than 0.4. It means only 7% or 5 villages with the highest rank. Priority number second s all of district that have rank between Vij = 0.26 and Vij = 0.4. It means only 62% or 44 villages. Priority number three is district that have a rank lower than Vij = 0.26, and only 31% or 22 villages. Impact in use of FMADM, calculated in rank, is the process runs effective and dynamic with changing of weighted. User can arrange of weighted as needed.
Operations research (OR) is an analytical method of problem-solving and decision-making that is useful in the management of organizations. In operations research, problems are broken down into basic components and then solved in defined steps by mathematical analysis.
Analytical methods used in OR include mathematical logic, simulation, network analysis, queuing theory , and game theory .The process can be broadly broken down into three steps.
1. A set of potential solutions to a problem is developed. (This set may be large.)
2. The alternatives derived in the first step are analyzed and reduced to a small set of solutions most likely to prove workable.
3. The alternatives derived in the second step are subjected to simulated implementation and, if possible, tested out in real-world situations. In this final step, psychology and management science often play important roles
Bank - Loan Purchase Modeling
This case is about a bank which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget. The department wants to build a model that will help them identify the potential customers who have a higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign. The dataset has data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
Our job is to build the best model which can classify the right customers who have a higher probability of purchasing the loan. We are expected to do the following:
EDA of the data available. Showcase the results using appropriate graphs
Apply appropriate clustering on the data and interpret the output .
Build appropriate models on both the test and train data (CART & Random Forest). Interpret all the model outputs and do the necessary modifications wherever eligible (such as pruning).
Check the performance of all the models that you have built (test and train). Use all the model performance measures you have learned so far. Share your remarks on which model performs the best.
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithminventionjournals
International Journal of Business and Management Invention (IJBMI) is an international journal intended for professionals and researchers in all fields of Business and Management. IJBMI publishes research articles and reviews within the whole field Business and Management, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
This document presents a proposed churn prediction model based on data mining techniques. The model consists of six steps: identifying the problem domain, data selection, investigating the data set, classification, clustering, and utilizing the knowledge gained. The authors apply their model to a data set of 5,000 mobile service customers using data mining tools. They train classification models using decision trees, neural networks, and support vector machines. Customers are classified as churners or non-churners. Churners are then clustered into three groups. The results are interpreted to gain insights into customer retention.
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ijsc
In fuzzy decision-making processes based on linguistic information, operations on discrete fuzzy numbers
are commonly performed. Aggregation and defuzzification operations are some of these often used
operations. Many aggregation and defuzzification operators produce results independent to the decisionmaker’s
strategy. On the other hand, the Weighted Average Based on Levels (WABL) approach can take
into account the level weights and the decision maker's "optimism" strategy. This gives flexibility to the
WABL operator and, through machine learning, can be trained in the direction of the decision maker's
strategy, producing more satisfactory results for the decision maker. However, in order to determine the
WABL value, it is necessary to calculate some integrals. In this study, the concept of WABL for discrete
trapezoidal fuzzy numbers is investigated, and analytical formulas have been proven to facilitate the
calculation of WABL value for these fuzzy numbers. Trapezoidal and their special form, triangular fuzzy
numbers, are the most commonly used fuzzy number types in fuzzy modeling, so in this study, such numbers
have been studied. Computational examples explaining the theoretical results have been performed.
Anomaly detection- Credit Card Fraud DetectionLipsa Panda
Team 7 presented their work on credit card fraud detection. They used a dataset containing over 284,000 transactions from September 2013, with 492 fraudulent transactions. They explored data sampling techniques like SMOTE to address class imbalance since frauds were only 0.172% of transactions. Their best model achieved 98% accuracy for fraud prediction while maintaining high precision and recall. They evaluated models using AUC and F1 score since accuracy is not appropriate for anomaly detection. They demonstrated their model's abilities through a live fraud detection demo and detecting anomalies in single transactions.
This document discusses the statistical analysis carried out on survey data to estimate the willingness to pay (WTP) for improved water quality using multilevel modeling (MLM). It describes:
1) Conducting a conventional logistic regression analysis on the single-bound dichotomous choice (SBDC) responses before using MLM to account for the hierarchical structure of the data.
2) Estimating WTP from the double-bound dichotomous choice (DBDC) data using MLM, which models the natural hierarchy in responses nested within individuals.
3) Estimating the incidence of benefits across income groups using the WTP estimates from a linear regression of stated WTP responses. This found WTP generally
The document outlines an 8-phase process for appropriately tuning automated transaction monitoring scenarios to balance identifying suspicious activity while maximizing efficiency. Key aspects of the process include using statistical techniques like clustering to set baseline thresholds, qualitatively assessing "above" and "below" threshold samples, and iteratively refining thresholds based on case outcomes. The goal is to establish thresholds representative of the 85th percentile to define the cutoff between normal and unusual transactions.
This chapter discusses confidence intervals for estimating population parameters from sample data. It begins by defining point estimates and interval estimates. The chapter then covers confidence intervals for estimating the mean of a population when the population variance is known and unknown, using the z-distribution and t-distribution respectively. It also discusses confidence intervals for estimating a population proportion. The chapter emphasizes that confidence intervals provide a range of plausible values for the population parameter rather than a single value.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
This document discusses various statistical analysis and feature engineering techniques that can be used for model building in machine learning algorithms. It describes how proper feature extraction through techniques like correlation analysis, principal component analysis, recursive feature elimination, and feature importance can help improve the accuracy of machine learning models. The document provides examples of applying different feature selection methods like univariate selection, recursive feature elimination, and principal component analysis on a diabetes dataset. It also explains the mathematics behind principal component analysis and how feature importance is estimated using an extra trees classifier. Overall, the document emphasizes how statistical analysis and feature engineering are important for effective model building in machine learning.
Cluster analysis in prespective to Marketing ResearchSahil Kapoor
Cluster analysis is a technique used to group similar objects into clusters. It is used in marketing research to segment customers and products. The key steps in cluster analysis are: 1) Formulating the problem by selecting relevant variables, 2) Selecting a distance measure such as Euclidean distance to assess object similarity, 3) Choosing a clustering procedure like hierarchical or k-means, 4) Deciding the number of clusters, 5) Interpreting and profiling the clusters. Cluster analysis has various applications in marketing like data reduction, identifying new product opportunities, and understanding consumer behavior.
This document presents a study comparing several machine learning models for personal credit scoring: logistic regression, multilayer perceptron, support vector machine, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy and was the most useful model according to a proposed measure that considers earnings from denying bad credits and losses from denying good credits. While other models had higher error rates for good credits, HLVQ-C balanced accuracy and usefulness the best, making it suitable for commercial credit scoring applications.
1. The document describes a project to predict customer churn for a telecom company using logistic regression, KNN, and Naive Bayes models.
2. Exploratory data analysis was conducted on usage, contract, payment and other customer data, finding some variable correlation.
3. Logistic regression performed best with 75% accuracy. KNN accuracy was also good with K=2.
4. The models identified contract renewal and monthly charges as critical factors for churn, suggesting the company focus on these areas.
Modelling the expected loss of bodily injury claims using gradient boostingGregg Barrett
This document summarizes an effort to model the expected loss of bodily injury claims using gradient boosting. Frequency and severity models are built separately and then combined to estimate expected loss. Gradient boosting is chosen as the modeling approach due to its flexibility. Tuning parameters like shrinkage, number of trees, and depth must be selected. The goal is predictive accuracy over interpretability. Performance is evaluated on a test set not used for model selection.
This document discusses using machine learning algorithms to predict household poverty levels. The goals are to build classification models to predict a household's poverty level as either "poor" or "non-poor" based on household attributes. Linear regression is proposed as the modeling algorithm. The document outlines collecting and preprocessing a household dataset, feature selection, model training and evaluation using metrics like MSE, RMSE and R-squared. References are provided on related work applying machine learning to poverty prediction using household surveys and satellite imagery.
This document provides a guide for building generalized linear models (GLMs) in R to accurately model insurance claim frequency and severity. It outlines steps for data preparation, including handling missing data, transforming variables, and removing outliers. It then discusses modeling count/frequency data with Poisson or negative binomial models including an offset for exposure. Severity is typically modeled with a gamma or normal distribution. The document provides examples of investigating interactions and comparing models using AIC, BIC, and residual analysis.
This document proposes modifications to Pawlak's conflict theory model based on graph theory. It suggests developing the conflict analysis system to predict how the opinions of neutral agents may change over time. The approach involves:
1) Creating matrices to represent direct conflicts, alliances, and neutral relationships between agents.
2) Computing higher power matrices through multiplication to represent indirect relationships over increasing path lengths.
3) Weighting the matrices based on path length and summing values to predict if neutral relationships may become conflicts or alliances based on direct and indirect influences.
4) Optionally performing logical OR operations on conflict matrices to identify any direct or indirect conflicts between agents.
This document contains 54 multiple choice questions about probability concepts from the textbook "Quantitative Analysis for Management, 11e". The questions cover topics such as fundamental probability concepts, mutually exclusive and collectively exhaustive events, statistically independent events, probability distributions including binomial and normal distributions, and Bayes' theorem. For each question, the answer and difficulty level is provided along with the topic area.
Automation of IT Ticket Automation using NLP and Deep LearningPranov Mishra
Overview of Problem Solved: IT leverages Incident Management process to ensure Business Operations is never impacted. The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. Manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
Solution: Multiple deep learning sequential models with Glove Embeddings were attempted and results compared to arrive at the best model. The two best models are highlighted below through their results.
1. Bi-Directional LSTM attempted on the data set has given an accuracy of 71% and precision of 71%.
2. The accuracy and precision was further improved to 73% and 76% respectively when an ensemble of 7 Bi-LSTM was built.
I built a NLP based Deep Learning model to solve the above problem. Link below
https://github.com/Pranov1984/Application-of-NLP-in-Automated-Classification-of-ticket-routing?fbclid=IwAR3wgofJNMT1bIFxL3P3IoRC3BTuWmhw1SzAyRtHp8vvj9F2sKZdq67SjDA
A Fuzzy Mean-Variance-Skewness Portfolioselection Problem.inventionjournals
A fuzzy number is a normal and convex fuzzy subsetof the real line. In this paper, based on membership function, we redefine the concepts of mean and variance for fuzzy numbers. Furthermore, we propose the concept of skewness and prove some desirable properties. A fuzzy mean-variance-skewness portfolio se-lection model is formulated and two variations are given, which are transformed to nonlinear optimization models with polynomial ob-jective and constraint functions such that they can be solved analytically. Finally, we present some numerical examples to demonstrate the effectiveness of the proposed models
Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...TELKOMNIKA JOURNAL
Ranking of an activity is very important to support work effectiveness. Previous works, ranking for distribution product is used by manual process or averaging value. Problem in this research, the research should be found the effective way to rank the distribution product. This research proposes assist the ranking with a computational model based on Fuzzy Multiple Decision Making (FMADM). Getting an effective ranking, a variable in FMADM computing is required. Variables is used in this research such as number of households, number of small-scale enterprises run by households, gross domestic regional income, and economic growth rate of a region. Research completion is assisted by using self-built research methods. Research method consists of determining value of origin, determining degree of membership, determining weight of each variable, calculation of relation matrix, calculation of the preference value in each village for ranking value, and last is sorting. Operationalized FMADM is gain a result with three priorities district. Priority number one is all of district that have a rank or Vij (alternative rank) higher than 0.4. It means only 7% or 5 villages with the highest rank. Priority number second s all of district that have rank between Vij = 0.26 and Vij = 0.4. It means only 62% or 44 villages. Priority number three is district that have a rank lower than Vij = 0.26, and only 31% or 22 villages. Impact in use of FMADM, calculated in rank, is the process runs effective and dynamic with changing of weighted. User can arrange of weighted as needed.
Operations research (OR) is an analytical method of problem-solving and decision-making that is useful in the management of organizations. In operations research, problems are broken down into basic components and then solved in defined steps by mathematical analysis.
Analytical methods used in OR include mathematical logic, simulation, network analysis, queuing theory , and game theory .The process can be broadly broken down into three steps.
1. A set of potential solutions to a problem is developed. (This set may be large.)
2. The alternatives derived in the first step are analyzed and reduced to a small set of solutions most likely to prove workable.
3. The alternatives derived in the second step are subjected to simulated implementation and, if possible, tested out in real-world situations. In this final step, psychology and management science often play important roles
Bank - Loan Purchase Modeling
This case is about a bank which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget. The department wants to build a model that will help them identify the potential customers who have a higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign. The dataset has data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
Our job is to build the best model which can classify the right customers who have a higher probability of purchasing the loan. We are expected to do the following:
EDA of the data available. Showcase the results using appropriate graphs
Apply appropriate clustering on the data and interpret the output .
Build appropriate models on both the test and train data (CART & Random Forest). Interpret all the model outputs and do the necessary modifications wherever eligible (such as pruning).
Check the performance of all the models that you have built (test and train). Use all the model performance measures you have learned so far. Share your remarks on which model performs the best.
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithminventionjournals
International Journal of Business and Management Invention (IJBMI) is an international journal intended for professionals and researchers in all fields of Business and Management. IJBMI publishes research articles and reviews within the whole field Business and Management, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
This document presents a proposed churn prediction model based on data mining techniques. The model consists of six steps: identifying the problem domain, data selection, investigating the data set, classification, clustering, and utilizing the knowledge gained. The authors apply their model to a data set of 5,000 mobile service customers using data mining tools. They train classification models using decision trees, neural networks, and support vector machines. Customers are classified as churners or non-churners. Churners are then clustered into three groups. The results are interpreted to gain insights into customer retention.
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ijsc
In fuzzy decision-making processes based on linguistic information, operations on discrete fuzzy numbers
are commonly performed. Aggregation and defuzzification operations are some of these often used
operations. Many aggregation and defuzzification operators produce results independent to the decisionmaker’s
strategy. On the other hand, the Weighted Average Based on Levels (WABL) approach can take
into account the level weights and the decision maker's "optimism" strategy. This gives flexibility to the
WABL operator and, through machine learning, can be trained in the direction of the decision maker's
strategy, producing more satisfactory results for the decision maker. However, in order to determine the
WABL value, it is necessary to calculate some integrals. In this study, the concept of WABL for discrete
trapezoidal fuzzy numbers is investigated, and analytical formulas have been proven to facilitate the
calculation of WABL value for these fuzzy numbers. Trapezoidal and their special form, triangular fuzzy
numbers, are the most commonly used fuzzy number types in fuzzy modeling, so in this study, such numbers
have been studied. Computational examples explaining the theoretical results have been performed.
Anomaly detection- Credit Card Fraud DetectionLipsa Panda
Team 7 presented their work on credit card fraud detection. They used a dataset containing over 284,000 transactions from September 2013, with 492 fraudulent transactions. They explored data sampling techniques like SMOTE to address class imbalance since frauds were only 0.172% of transactions. Their best model achieved 98% accuracy for fraud prediction while maintaining high precision and recall. They evaluated models using AUC and F1 score since accuracy is not appropriate for anomaly detection. They demonstrated their model's abilities through a live fraud detection demo and detecting anomalies in single transactions.
This document discusses the statistical analysis carried out on survey data to estimate the willingness to pay (WTP) for improved water quality using multilevel modeling (MLM). It describes:
1) Conducting a conventional logistic regression analysis on the single-bound dichotomous choice (SBDC) responses before using MLM to account for the hierarchical structure of the data.
2) Estimating WTP from the double-bound dichotomous choice (DBDC) data using MLM, which models the natural hierarchy in responses nested within individuals.
3) Estimating the incidence of benefits across income groups using the WTP estimates from a linear regression of stated WTP responses. This found WTP generally
The document outlines an 8-phase process for appropriately tuning automated transaction monitoring scenarios to balance identifying suspicious activity while maximizing efficiency. Key aspects of the process include using statistical techniques like clustering to set baseline thresholds, qualitatively assessing "above" and "below" threshold samples, and iteratively refining thresholds based on case outcomes. The goal is to establish thresholds representative of the 85th percentile to define the cutoff between normal and unusual transactions.
This chapter discusses confidence intervals for estimating population parameters from sample data. It begins by defining point estimates and interval estimates. The chapter then covers confidence intervals for estimating the mean of a population when the population variance is known and unknown, using the z-distribution and t-distribution respectively. It also discusses confidence intervals for estimating a population proportion. The chapter emphasizes that confidence intervals provide a range of plausible values for the population parameter rather than a single value.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
This document discusses various statistical analysis and feature engineering techniques that can be used for model building in machine learning algorithms. It describes how proper feature extraction through techniques like correlation analysis, principal component analysis, recursive feature elimination, and feature importance can help improve the accuracy of machine learning models. The document provides examples of applying different feature selection methods like univariate selection, recursive feature elimination, and principal component analysis on a diabetes dataset. It also explains the mathematics behind principal component analysis and how feature importance is estimated using an extra trees classifier. Overall, the document emphasizes how statistical analysis and feature engineering are important for effective model building in machine learning.
Cluster analysis in prespective to Marketing ResearchSahil Kapoor
Cluster analysis is a technique used to group similar objects into clusters. It is used in marketing research to segment customers and products. The key steps in cluster analysis are: 1) Formulating the problem by selecting relevant variables, 2) Selecting a distance measure such as Euclidean distance to assess object similarity, 3) Choosing a clustering procedure like hierarchical or k-means, 4) Deciding the number of clusters, 5) Interpreting and profiling the clusters. Cluster analysis has various applications in marketing like data reduction, identifying new product opportunities, and understanding consumer behavior.
This document presents a study comparing several machine learning models for personal credit scoring: logistic regression, multilayer perceptron, support vector machine, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy and was the most useful model according to a proposed measure that considers earnings from denying bad credits and losses from denying good credits. While other models had higher error rates for good credits, HLVQ-C balanced accuracy and usefulness the best, making it suitable for commercial credit scoring applications.
1. The document describes a project to predict customer churn for a telecom company using logistic regression, KNN, and Naive Bayes models.
2. Exploratory data analysis was conducted on usage, contract, payment and other customer data, finding some variable correlation.
3. Logistic regression performed best with 75% accuracy. KNN accuracy was also good with K=2.
4. The models identified contract renewal and monthly charges as critical factors for churn, suggesting the company focus on these areas.
Modelling the expected loss of bodily injury claims using gradient boostingGregg Barrett
This document summarizes an effort to model the expected loss of bodily injury claims using gradient boosting. Frequency and severity models are built separately and then combined to estimate expected loss. Gradient boosting is chosen as the modeling approach due to its flexibility. Tuning parameters like shrinkage, number of trees, and depth must be selected. The goal is predictive accuracy over interpretability. Performance is evaluated on a test set not used for model selection.
This document discusses using machine learning algorithms to predict household poverty levels. The goals are to build classification models to predict a household's poverty level as either "poor" or "non-poor" based on household attributes. Linear regression is proposed as the modeling algorithm. The document outlines collecting and preprocessing a household dataset, feature selection, model training and evaluation using metrics like MSE, RMSE and R-squared. References are provided on related work applying machine learning to poverty prediction using household surveys and satellite imagery.
This document provides a guide for building generalized linear models (GLMs) in R to accurately model insurance claim frequency and severity. It outlines steps for data preparation, including handling missing data, transforming variables, and removing outliers. It then discusses modeling count/frequency data with Poisson or negative binomial models including an offset for exposure. Severity is typically modeled with a gamma or normal distribution. The document provides examples of investigating interactions and comparing models using AIC, BIC, and residual analysis.
The objective was to develop a neural network model to predict loan defaults using variables like number of derogatory reports, number of delinquent credit lines, debt-to-income ratio, age of oldest credit line, and loan divided by value. The best model had a training profit of $25.48 million and testing profit of $27.48 million, outperforming logistic regression. Key changes were reducing variables and lowering the probability cutoff, which increased profits and lift. The neural network model was simpler, more consistent and predictable than complex models, with training and testing profits varying less than $150,000.
MACHINE LEARNING CLASSIFIERS TO ANALYZE CREDIT RISKIRJET Journal
This document discusses using machine learning classifiers to analyze credit risk. It examines various machine learning techniques for credit risk analysis, including Bayesian classifiers, naive Bayes, decision trees, k-nearest neighbors, multilayer perceptrons, support vector machines, and ensemble methods like bagging and boosting. Two credit datasets from the UCI machine learning repository were used to test the accuracy of these classifiers. The results showed decision trees had the highest accuracy at 89.9% and 71.25% on the two datasets, while k-nearest neighbors had the lowest. Future work could involve rebuilding the models with more accurate data to improve performance. The objective of credit risk analysis is to help banks and financial institutions balance approving loans to creditworthy borrowers
An application of artificial intelligent neural network and discriminant anal...Alexander Decker
This document presents a study that compares the predictive abilities of artificial neural networks and linear discriminant analysis for credit scoring. A credit dataset from a Nigerian bank with 200 applicants and 15 variables is used to build both neural network and linear discriminant models. The models are evaluated based on measures like accuracy, Wilks' lambda, and canonical correlation. Key findings are that the neural network model performs slightly better with less misclassification cost. However, variable selection is important for both models' success. Age, length of service, and other borrowing are found to be the most important predictor variables.
The document provides an overview of credit scoring and scorecard development. It discusses:
- The objectives of credit scoring in assessing credit risk and forecasting good/bad applicants.
- The types of clients that are categorized for scoring, including good, bad, indeterminate, insufficient, excluded, and rejected.
- The research objectives and challenges in building statistical models to assign risk scores and monitor model performance.
- The research methodology involving data partitioning, variable binning, scorecard modeling using logistic regression, and scorecard evaluation metrics like KS, Gini, and lift.
The document describes an 8-phase statistical approach to developing a customer risk rating model. The model assesses money laundering risk for a bank's customers based on their profiles. Phase 1 defines risk categories like geography, customer, product/account, and transaction attributes. Phase 2 analyzes data completeness and variation to select meaningful attributes. Phase 3 tests attribute correlations. Phase 4 samples customer profiles for subject matter expert risk ratings used to calibrate the model in Phase 5. Phase 6 evaluates the model's performance and uncertainty. Phase 7 implements the optimized model to automatically rate customer risk.
This document discusses applying machine learning algorithms to three datasets: a housing dataset to predict prices, a banking dataset to predict customer churn, and a credit card dataset for customer segmentation. For housing prices, linear regression, regression trees and gradient boosted trees are applied and evaluated on test data using R2 and RMSE. For customer churn, logistic regression and random forests are used with sampling to address class imbalance, and evaluated using confusion matrix metrics. For credit card data, k-means clustering with PCA is used to segment customers into groups.
The document analyzes models for predicting loan default using a German credit dataset. It fits generalized linear models, generalized additive models, linear discriminant analysis, and classification trees to the data. Based on out-of-sample testing, the linear discriminant analysis model provided the best results with a minimum misclassification rate of 0.40 and maximum area under the ROC curve of 0.867. However, the performance of all the models was quite similar.
This document provides an overview of consumer credit risk modeling and scoring. It discusses various statistical methods used for credit scoring like logistic regression, neural networks, and support vector machines (SVM). For SVM, it describes how the optimal separating hyperplane is chosen to maximize the margin between different classes of data. It also discusses challenges in consumer lending and best practices for credit risk management.
This document summarizes a student research project that developed a machine learning model to predict loan approval. The researchers tested various algorithms on a dataset of 615 loan applications and found that logistic regression performed best with an accuracy of 88.7%. They created a web application where users can enter loan application details and the model will predict approval. While the model considers many attributes, in reality a single strong attribute could also determine approval, which the system cannot account for.
A Novel Performance Measure For Machine Learning ClassificationKarin Faust
This document provides a comprehensive overview of performance measures used to evaluate machine learning classification models. It discusses measures based on confusion matrices (e.g. accuracy, recall, precision), probability deviations (e.g. mean absolute error, brier score, cross entropy), and discriminatory power (e.g. ROC curves, AUC). The document highlights limitations of individual measures and proposes a framework to construct a new composite measure based on voting results from existing measures to better evaluate model performance.
This document provides an overview of machine learning and logistic regression. It discusses key concepts in machine learning like representation, evaluation, and optimization. It also discusses different machine learning algorithms like decision trees, neural networks, and support vector machines. The document then focuses on logistic regression, explaining concepts like maximum likelihood estimation, concordance, and confusion matrices which are used to evaluate logistic regression models. It provides an example of using logistic regression for a banking customer classification problem to predict defaults.
The document proposes using several machine learning algorithms, including multilayer perceptron neural networks, logistic regression, support vector machines, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C), to improve personal credit scoring accuracy. The algorithms were tested on a large dataset from a Portuguese bank containing over 400,000 entries. HLVQ-C achieved the most accurate results, outperforming traditional linear methods. The document introduces a "usefulness" measure to evaluate classifiers based on earnings from correctly denying credit to risky applicants and losses from misclassifications.
The document presents a study comparing various machine learning algorithms for personal credit scoring, including logistic regression, multilayer perceptron, support vector machines, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy on both datasets and the best performance on a proposed "usefulness" metric, making it the most effective model for credit scoring applications according to this research.
A NOVEL PERFORMANCE MEASURE FOR MACHINE LEARNING CLASSIFICATIONIJMIT JOURNAL
Machine learning models have been widely used in numerous classification problems and performance measures play a critical role in machine learning model development, selection, and evaluation. This paper covers a comprehensive overview of performance measures in machine learning classification. Besides, we proposed a framework to construct a novel evaluation metric that is based on the voting results of three performance measures, each of which has strengths and limitations. The new metric can be proved better than accuracy in terms of consistency and discriminancy.
A Novel Performance Measure for Machine Learning ClassificationIJMIT JOURNAL
Machine learning models have been widely used in numerous classification problems and performance measures play a critical role in machine learning model development, selection, and evaluation. This paper covers a comprehensive overview of performance measures in machine learning classification. Besides, we proposed a framework to construct a novel evaluation metric that is based on the voting results of three performance measures, each of which has strengths and limitations. The new metric can be proved better than accuracy in terms of consistency and discriminancy.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
1. BI APPLICATION IN FINANCIAL SECTOR - CREDIT SCORING OF RETAIL LOANS USING A HYBRID MODELING APPROACH
Prof. S. Chandrasekhar, B. Tech., M. Tech. ( IIT – Kanpur ), Ph. D. ( U.S.A. )
Chair Professor and Area Chair – I.T.
FORE School of Management, New Delhi, B-18, Qutub Institutional Area, New Delhi – 110016 ( India )
Email – sch@fsm.ac.in
Abstract
Retail Loans now-a-days form a major proportion of Loan Portfolio. Broadly they can be classified as (i) Loans
for Small and medium Sector and (ii) Loans for Individuals. The objective Credit Scoring is that we use enough
of precaution before the sanction of the loan so that the loans do not go bad after disbursement. This will
increase to the bottom line of the financial institution and also reduce the Credit Risk.
Techniques used to perform Credit Scoring Varies for the above two classes of loans. In this paper, we
concentrate on the application of Credit Scoring for individual or so called personal loans like – Auto loan,
buying goods like Televisions, Refrigerators etc. Large numbers of loans are being disbursed in these areas.
Though the size of the loan may be small, when compared to Small/Medium Scale Industry, if one does not
control the defaults, the consequences will be disastrous.
From the characteristics of borrower, product characteristics a Credit Score is computed for each applicant. If
the Score exceeds a given threshold loan is sanctioned. If it is below the threshold, loan is sanctioned. If it is
below the threshold, loan is rejected. In practice a buffer zone is created near the threshold so that those Credit
Scores that fall in buffer zone, detailed investigation will be done before a decision is taken.
Two broad classes of Scoring Model exists (i) Subjective Scoring and (ii) Statistical Scoring.
Subjective Scoring is based on intuitive judgement. Subjective Scoring works but there is scope for improvement
one limitation is prediction of risk is person dependent and focuses on few characteristics and may be mistakenly
focusing on wrong characteristics.
Statistical Scoring uses hardcore data of borrower characteristics, product characteristics and uses mathematical
models to predict the risk. The relation is expressed in the form of an equation which finally gets converted to a
score. Subjectivity will be reduced and variable(s) that are important to scoring are identified based on strong
mathematical foundation.
Different Models have been used in Credit Scoring like Regression, Decision Tree, Discriminate Analysis and
Logistic Regression. Most of the times, a single model is used to compute the Credit Score. This method works
well when the underlying decision rule is simple and when the rule becomes complex, the accuracy of the model
diminishes very fast.
In this Research Paper, a combination of Decision Tree and Logistic Regression is used to determine the weights
that are to be assigned to different characteristics of the borrower. Decision Tree is used at first level of analysis
to narrow down the importance of Variables and overall weights that needs to be assigned. It is also used for
optimum groupings of numeric and non-numeric Variables. At second level, Logistic Regression is used to
compute odd ratios a variant of probability which in turn is used to assign weights within an attribute and to
individual levels in an attribute.
This has been tested on real life data and found to work better compared to methods using a single stage
models. An accuracy of around 80% in decision is obtained which is good for any modeling study as there is no
model which gives 100% accuracy.
The next Section explains the Methodology, Data Used and Results.
2. SPSS Software has been used for Model Building and Data Analysis.
Methodology Adopted
A sample data of about Three Hundred and Seventy Nine Records were taken from a Bank Database. The
following Variables were used for the Analysis:-
SEX : MALE/FEMALE
AGE : CONTINUOUS
TIME-AT-ADDRESS : HOW LONG THE CUSTOMER IS LIVING AT THIS ADDRESS
RESIDENCE STATUS : OWNER, RENTED
TELEPHONE : YES/NO
TIME-BANK HOW LONG IS CUSTOMER WITH THE BANK ; NUMERIC
HOME EXPENSES : NUMERIC
DECISION : SANCTIONED/REJECTED
OCCUPATION : CREATIVE, DRIVER, GUARD, LABOURER, MANAGER, OFFICE-STAFF,
PRODUCTION, PROFESSIONAL, SALES, SEMI-PRODUCTION
JOB STATUS : GOVERNMENT, PRIVATE, SELF-EMPLOYED
TIME-EMPLOYED : LENGTH OF EMPLOYMENT - NUMERIC
LIAB. - REFERENCE : TRUE/FALSE
ACC. - REFERENCE : GIVEN/OTHER INSTRUCTIONS
BALANCE : NUMERIC
DEPENDENT VARIABLE : DECISION
INDEPENDENTVARIABLE(S) : ALL OTHERS
Since the Independent Variables are a Combination of Numeric and Non-numeric Variable(s), and Dependent
Variable Numeric, Decision Tree using Chi-square Interaction Detection (CHAID) Method is used to analyse the
data as the dependent Variable is not numeric. We cannot use regression. Out of 379 Cases, 53% (201) Cases
are rejected and 47% (178) have been accepted.
Summary of the tree is given below:-
Model Summary
Specifications Growing Method CHAID
Dependent Variable DECN_RECD
Independent Variables SEX, AGE, TIME_AT_ADDRESS, RES_STATUS,
TELEPHONE, OCCUPATION, JOB_STATUS,
TIME_EMPLOYED, TIME_BANK, LIAB_REF, ACC_REF,
HOME_EXPN, BALANCE
Results Independent Variables
Included
TIME_BANK, TIME_EMPLOYED
Number of Nodes 8
Number of Terminal Nodes 5
Depth 2
3. Graphical Tree is shown below:-
Classification Accuracy is given below:-
Classification
Observed
Predicted
Rejected Sanction Percent Correct
Rejected 178 23 88.6%
Sanction 74 104 58.4%
Overall Percentage 66.5% 33.5% 74.4%
Growing Method: CHAID
Dependent Variable: DECN_RECD
If you look at Model Summary Table, the model has identified Time-Bank, Time-Employed, as important
Variables for Prediction and it has rejected all other Variables. The overall classification accuracy is 74%. All
Non-numeric Variables like Occupation, Job Status, Residential Status etc. are ignored.
One reason for rejecting all other Non-numeric Variables may be due to sample size of the data set and heavy
influence of Numeric Variables. In practice, Non-numeric Variables also play an important role in Credit Scoring.
To validate this assumption, in the next step, Two Decision Trees have been built – One taking only Numeric
Variables : Time-At-Address, Time-Bank, Home-Expenses, Balance. The classification accuracy is 77%.
Then for Second Tree using only Non-numeric Variables like Occupation, Job Status, Residence Status etc. are
considered. The Classification accuracy was 66% which is not low.
4. Hence one can include both Numeric and Non-numeric Variable but with a slightly higher weightage for Numeric
compared to Non-numeric Variables. The weights will be inversely proportional to error i.e. if the error is high
weight will be less and vice-versa.
So, the overall weightage assigned to Numeric Variables are : (77/(77+66)) x 100 = 54% ; it is (66/(77+66)) x
100 = 46%.
The next step is to assign Individual Weights to different levels within Numeric and Non-numeric Variables
keeping overall weights to 54% and 46%.
Logistic Regression is used for assigning Individual Weights.
Logistic Regression gives a functional relation between Dependent and Independent Variables. The only
difference between Logistic Regression and Usual Regression is the Dependent Variable is categorical and has
only two states of nature like Loan Application Accepted/Rejected, Person defaulting a loan/not defaulting etc.
The functional form can be expressed as:
α0 , α1 , α2 , …………………..are called the Logit Coefficients and will be estimated using sample data using
maximum likelihood estimation.
Out of all the Independent Variables taken for model building, I have explained below methodology of assigning
Individual Weights for different attribute level values. One example for Numeric Variable and another for Non-
numeric Variable. This methodology is repeated for all other Numeric and Non-numeric Variables.
When all the Numeric Variables are taken as Independent Variable(s), and DECISION as Dependent Variable,
Logistic Regression is run. Based on the significance the following Variables are retained:-
TIME-BANK (1.53) TIME-EMPLOYED (1.20) BALANCE(1.0)
The numbers in the bracket indicate the odd ratio and all are significant at 5%. Put alternatively what it means is
these Coefficients are correct 95% of time. Only 5% of time they may be wrong. COX_NELL R Square of about
0.4. Validates the Hypothesis, that Logistic Regression fits the data. If value of COX_NELL R Square is below
0.05 one can conclude Logistic Regression does not fit the data.
Out of Total Weight of 54% assigned to all Numeric Variables, the break-up of Weights for THREE significant
Numeric Variables are worked out as follows using the Logit Coefficients:-
TIME-BANK = (1.53/(1.53+1.20+1.0))x54=22
TIME-EMPLOYED = (1.20/(1.53+1.20+1.0))x54=17
BALANCE = (1.0/1.53+1.20+1.0))X54=15
ProbabilityLn
Loan rejected
Loan sanctioned
= α0+α1*Time-at-Address+α2*Time-employed+
5. At the next level let us take one Variable used in Scoring i.e. TIME-BANK. The total weight for this is 22 and this
Variable takes on values in the range: 0-67. The total weight of 22 needs to be broken into different bins and
the binning should be automatic instead of person dependent.
For this, the methodology used is as follows:-
The histogram was displayed and using Visual binning techniques the Variable TIME-BANK was binned into 5
bins. This may not be optimum. The next step was to use the Decision Tree to automatically determine the
number of bins. The Dependent Variable is DECISION and Independent Variable is TIME-BANK grouped into 5
bins. The output of the Decision Tree is given below:-
FIGURE : 2
Looking at the Tree output shown above it is clear that only THREE bins are optimum compared to manual FIVE
bins. The following bins are given as under:-
TIME-EMPLOYED ≤ 1.0 Bin1
1.0 < TIME-EMPLOYED ≤ 3.0 Bin2
TIME-EMPLOYED > 3.0 Bin3
Again Logistic Regression is used to determine Individual Weights for each bin.
The Logit Coefficients for Bin 1, Bin 2 and Bin 3 are 0.4, 3.0 and 19.0 respectively.
The weights assigned to different bins are:
6. Similar process is repeated for other Two Variables i.e. TIME-WITH-BANK and BALANCE.
When this process was applied to Non-numeric Variables like JOB-STATUS, OCCUPATION, RESIDENTIAL
STATUS, ACC.- REF. and TELEPHONE only TWO Variables were significant.
Total of 46% weight for Non-numeric Variables is split into two parts. OCCUPATION : 40 and JOB STATUS : 6.
Using Decision Tree and Logistic Regression the Weights for different attribute values under Occupation is
given below:-
Professional 12.5 Creative 2.80
Semi-professional 9.5 Sales 1.92
Office Staff 5.2 Driver 1.00
Production 2.9 Labourer 1.27
Guard 0.86
TOTAL 40
For Job-Status Weights are:
Government 2.4
Self-employed 2.5
Private 1
TOTAL 6.0
When once the Weights for all Independent Variables of significance have been computed using Decision Tree –
Logistic Regression iteratively the data will be scored.
The Score for each Record will be Weight of the Bin to which attribute level falls. Data will be sorted in
Descending Order of Credit Score.
A part of Scored Data is given below:-
DECN_REC
D
BAL_BIN
ED_TWO
_LVL
TIMES_
EMPL_B
IN3
TIME_
BNK_BI
N3
TIMES_
EMPLO
YED_S
CORE
TIME_B
ANK_S
CORE
JOB_S
TAT_S
CORE
OCCUPA
TION_S
CORE
BALAN
CE_SC
ORE
Total_cre
dit_Score
Rejected 2 1 1 0.9 0.4 1 9.5 0.5 12.3
Rejected 2 1 1 0.9 0.4 1 1 0.5 3.8
Rejected 2 2 1 4.4 0.4 2.4 1.92 0.5 9.62
Sanction 2 1 2 0.9 3 1 9.5 0.5 14.9
Sanction 2 1 3 0.9 18.6 1 9.5 0.5 30.5
Sanction 1 1 3 0.9 18.6 1 12.5 13.5 46.5
Rejected 2 1 1 0.9 0.4 1 2.9 0.5 5.7
Sanction 2 2 3 4.4 18.6 1 12.5 0.5 37
Rejected 2 1 2 0.9 3 1 5.2 0.5 10.6
Rejected 2 2 2 4.4 3 1 2.8 0.5 11.7
Rejected 2 1 1 0.9 0.4 2.4 2.8 0.5 7
Sanction 1 3 3 11.8 18.6 1 2.8 13.5 47.7
Rejected 2 1 1 0.9 0.4 1 2.8 0.5 5.6
Rejected 2 1 2 0.9 3 1 1.92 0.5 7.32
Sanction 2 1 3 0.9 18.6 1 2.8 0.5 23.8
Sanction 2 3 1 11.8 0.4 2.4 2.9 0.5 18
Rejected 2 3 1 11.8 0.4 1 1.95 0.5 15.65
Sanction 2 3 3 11.8 18.6 2.4 2.9 0.5 36.2
Rejected 2 1 1 0.9 0.4 2.4 1 0.5 5.2
Now one can try different cut off’s and see how the accuracy changes. As an example let us put a cut off score
of 14 i.e. for these whose Credit Score is greater than or equal to 14, Loan will be sanctioned and for those less
than 14, the loan will be rejected. The total number rejected with this threshold is 165 out of Total Number of
201 rejected cases. The accuracy of rejection is 82%. Similarly, there are 124 accept cases for threshold 14.
7. Conclusion
The paper illustrates Hybrid model for Credit Scoring Using Decision Tree and Logistic Regression in a recursive
manner. SPSS Software has been used for Model building and Data analysis. It also optimizes the Bins for both
numeric and non-numeric Variables using Decision Tree. Many times subjective Weights are used based on
experience of loan manager. This paper explores how financial institution specific data can be used to derive the
Weights automatically. One can compare and fine tune the subjective scores based on actual data. Further work
involves integration of an optimization technique like Linear/Goal programming to fine tune the Weights.
References
Credit Scoring and its applications, Lyn Thomas Et.al, SIAM Publication, June 2002
Credit Risk Score Cards : Developing and Implementing Intelligent Credit Scoring, Johnwiley and SAS Series,
2005
Credit Scoring for Risk Managers, Elizabeth Mays and Niall Lynar, Create Space, 2011
Managing Consumer Lending Business, David Lawrence, Solomon Lawrence Partners, July 2002
Credit scoring systems: A critical analysis, N.CAPON (1982), J. Marketing, 46, 82-91
Recent developments in the application of credit scoring techniques to the evaluation of commercial loans,
R.A. EISENBEIS (1996), IMA J. Math. Appl. Business Indust., 7, 271-290
An Introduction to Credit Scoring, Athena Press, E.M. LEWIS (1992), San Rafael, CA
Methodologies for classifying applicants for credit, in Statistics in Finance, L.C.THOMAS (1998), D.J. Hand
and S.D. Jacka, eds., Arnold, London, 83-103
Credit Scoring and Credit Control, L.C.THOMAS, J.N. CROOK AND D.B. EDELMAN (1992), Oxford University
Press, Oxford
Total no. of accept cases : 178
Classification Accuracy of accept : 124/178 x 100 = 70%
Overall Classification Accuracy = 76% : (Accept to Accept + Reject to Reject)/Total Cases