Machine learning project

1 | Page
DATA SCIENCE PORTFOLIO
Jombaba.s7@gmail.com
I enjoy working on data and highlighted in this portfolio is one of the projects I have implemented using the python-based tools and
programming platform. The entire project is presented in five sub-projects with each of them capturing specifics of the entire work.
Imagine that, a financial institution / bank wish to find a solution to a ‘Customer Acquisition and Customer Retention’ related
problem. As a data Scientist, this is my attempt at providing a wholesome solution and the series of five projects illustrate a plausible
approach in resolving the problem.
Machine Learning Project:
Purpose:
The purpose of the machine learning project is to find the best classifier for the class attribute A16 of our dataset. Essentially, we
would be looking for a classifier with the highest separability measure, i.e. one which most clearly separate between the + and -
values of the class variable A16.
Dataset:
The dataset is found at: - https://archive.ics.uci.edu/ml/datasets/credit+approval.
The dataset comprises continuous and nominal attributes of small and large values. For reasons of privacy, the dataset was
published with column labels A1 – A16 instead of the actual descriptive labels.
Number of instances (observations) = 690. Number of attributes =15 (columns A1-A15). There is one class attribute (column A16)
and 307 (44.5%) of the classifier is “+” while 383 (55.5%) is “-“
Attribut
e Label
Value Type
A1 Nominal

2 | Page
A2 Continuou
s
A3 Continuou
s
A4 Nominal
A5 Nominal
A6 Nominal
A7 Nominal
A8 Continuou
s
A9 Nominal
A10 Nominal
A11 Continuou
s (Integer)
A12 Nominal
A13 Nominal
A14 Continuou
s (Integer)
15 Continuou
s (Integer)
A16 Class
attribute
Process:
The pre-processing exercise of cleaning-up the ‘Credit Card Application’ dataset provided a thoroughly balanced dataset for the
machine learning stage. Basically, the 67 missing values were replaced with statistically derived substitutes.
The dataset was further evaluated by doing a cross validation per given standard deviation or mean of error. For reasons including
high correlation between columns, outliers and variance of < 0.005, some rows and columns of the dataset were eliminated.

3 | Page
Hence, our analysis is based on a dataset comprising of 538 records and 12 variables. Using stratified sampling the dataset was
partitioned into 80% training set and 20% testing set. We developed three predictors so as to compare the fit of the Decision Tree,
Logistic Regression and Tree Ensemble models.
We use the ROC-AUC as our evaluation of fit metric for the ability of the model to predict the value of the class variable, A16. Also,
mention is made of how to deploy the chosen model.
Descriptions of the three models are highlighted below.
The Decision Tree model:
The Receiver Operating Characteristic Curve (ROC-AUC) for the Decision Tree Model

4 | Page
With a ROC-AUC value of 0.823 and accuracy of 0.806, this is the best fitted model of the three we evaluated. Furthermore, the
model is stable and exhibits the following characteristics.
The confusion matrix for the decision tree model.

5 | Page
Accuracy statistics for the decision tree model.
The Logistic Regression model:
The Receiver Operating Characteristic Curve (ROC-AUC) for the Logistic Regression Model:

6 | Page
This is an unstable model with a ROC-AUC value of 0.764 and accuracy of 0.787. Furthermore, the model exhibits the following
characteristics.
The confusion matrix for the logistic regression model

7 | Page
Accuracy statistics for the logistic regression model
The Tree Ensemble model
The Receiver Operating Characteristic Curve (ROC-AUC) for the Tree Ensemble Model

8 | Page
This is a poor and unstable model with a ROC-AUC value of 0.3969. This model has an accuracy value of 0.769 and exhibits the
following characteristics.

9 | Page
The confusion matrix for the tree ensemble model
Accuracy statistics for the tree ensemble model.
Preferred Model
The decision tree model is best at separating the + and – values of the feature variable A16. With an AUC value of 0.823, the model
handles the separability of these classes quite efficiently.
Furthermore, with an accuracy of 0.806 the decision tree model is preferred to the other models. This is an indication that this
model is more dependable and steadier than the other two.
Observe also that, the true positive rate is highest standing at 0.775. This indicates that, the decision tree model stands out as the
most sensitive in correctly predicting a positive response rate (recall) where the value of variable A16 = +.
Also, the specificity of this model stands at 0.824. Thus, the true negative rate (specificity) is being efficiently predicted. Though this
is not the model with highest score for all characteristics, the decision tree model is preferable to the other two models.
Furthermore, the precision (positive predictive value) is pretty high at 0.721. Again, though this is not the highest among the three
models we considered, the decision tree model is most stable.

10 | Page
Deployment of the Decision Tree Model:
108 records of the dataset were used in testing the model. This is an extract of the predicted values using the decision tree model.
Pay attention to the two columns that are arrowed.

11 | Page
Deployment of the Logistic Regression Model:
108 records of the dataset were used in testing the model. This is an extract of the predicted values using the logistic regression
model. Pay attention to the two columns that are arrowed.

12 | Page
Deployment of the Tree Ensemble Model:
108 records of the dataset were used in testing the model. This is an extract of the predicted values using the tree ensemble model.
Pay attention to the two columns that are arrowed.

13 | Page
Effect of the preferred model:
We consider the likely impact, in terms of $ saved, our choice of model might have on the business of the credit provider. We base
our deductions on the following assumptions:
Assumptions
1: It costs a bank or credit offering companies about $250.00 to acquire each new customer. This is based on the estimate of
Mercator Advisory group-leading trusted advisor to the payments and banking industries.
2: At 15% APR per year per person the bank or credit provider will generate interest rate of between $400.00- $450.00 on every
$3,000.00 credit provided to the customer. This is by industry standards.
3: We do not concern ourselves with other fees (transaction, etc.) that are charged to cardholders.
4: The + value of the classifier variable represents a successful application while, - represents a failed application.
Business Impact
We use the tree ensemble model as the base for evolving our estimates. For our purpose, it is reasonable to compare the tree
ensemble and decision tree models’ accuracy and sensitivity i.e. the % of total number of + relevant results which were correctly
classified by the algorithm.
Using the Sensitivity measure
The decision tree model has a sensitivity measure of 0.775 predicting that about 22.5% of applicants can be expected to fail. As a
result of using this model, it can be expected that about $5,625.00 of every $25,000.00 spent on acquiring 100 customers is wasted
on applicants who do not get approved.
Similarly, with sensitivity measure of 0.55, the tree ensemble model predicts that about 45.0% of applicants can be expected to fail.
Hence, using this model it can be expected that about $11,250.00 of every $25,000.00 spent on acquiring 100 customers is wasted
on applicants who do not get approved.
Subsequently, the decision tree model can be expected to save the credit provider up to 50.0% of the fund wasted on applicants
who do not get approved.

14 | Page
Based on these results, using the decision tree model instead of the tree ensemble model is expected to yield a drop of up to 50% in
funds wasted on failed applicants. Thus, the management of credit card application can be made more efficient by adopting an
improved model in order to achieve higher levels of savings.
tyJA

Machine learning project

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Machine learning project

Similar to Machine learning project (20)

Recently uploaded

Recently uploaded (20)

Machine learning project