Primary Goals
1. To determine what factors are driving the lead conversion process.
2. To Identify which leads are more likely to convert to paid customers.
Data Description
3. Dataset consists of 4613 rows and 15 columns.
Modelling Strategies
4. Plan
4.1 Perform Dummy Encoding
4.2 List Variables for Modeling
4.3 Identify metric of interest to judge model's performance
5. Build
5.1 Build Logistic Regression Model (Preliminary Model)
5.2 Observe the metrics of the model
6. Improve
6.1 Identify the significant variables
6.2 Rebuild model
6.3 Observe the metrics of the models
7. Decide
7.1 Compare the results of Logistic Regression model (Base model) and Decision Tree Model
7.2 Conclude on best model for this project
8. Recommend
8.1 Determine factors driving the lead conversion process
8.2 Recommend what that may help to identify which leads are more likely to convert to paying customers
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
Predictive Analysis - Using Insight-informed Data to Determine Factors Driving the Lead Conversion Process
1. Using Insight-informed Data to
Determine Factors Driving the
Lead Conversion Process
An Application of
Logistic Regression Modelling
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
2. Agenda
1. Unlock Insights, Efficient Workflows
2. Machine Learning
3. Logistic Regression Model
4. Primary Goals
5. Description of Dataset
6. Modelling Strategies
7. Findings, Conclusions, and Recommendations
2
An Application of Logistic Regression Modelling
3. Unlock Insights, Efficient Workflows
Traditional Organisations
and Workflows
• Workflow defines the specific steps
taken to complete a task or process
• Workflow is supported by systems,
procedures and roles to ensure tasks
are completed efficiently and
consistently
Artificial Intelligence (AI)
Enables:
• Analysis of large amount of data for
its patterns, trends, and insights
• Automation of repetitive and time
consuming tasks
• Predictions of future events or
outcomes
AI Augmenting Workflow
• Unlikely that AI completely eliminate
need for workflows in organisations
• Since workflows provide structure
and ensure tasks to be completed
correctly and consistently
• AI can help optimise workflows by
identifying bottlenecks and suggesting
improvements
1 2 1+2
An Application of Logistic Regression Modelling 3
4. Machine Learning
• Machine learning, a branch of
AI, uses algorithms to learn
from data, improving on tasks
without explicit instructions
• Used for automation, accuracy,
and personalization. It is used in
image recognition, language
processing, and more
4
An Application of Logistic Regression Modelling
5. Logistic Regression Modelling
• Logistic regression is a statistical model that
predicts the probability of an outcome being
one of two classes (e.g., win/lose, yes/no)
• It is popular for its ease of interpretation and
efficiency, making it well-suited for tasks like
spam detection or sentiment analysis where
factors influencing the outcome need to be
understood
5
An Application of Logistic Regression Modelling
6. Primary Goals
• To determine what factors are driving
the lead conversion process
• To Identify which leads are more
likely to convert to paid customers
An Application of Logistic Regression Modelling
7. Description of Dataset
Data Description
Dataset consists of 4613 rows and 15
columns
Data Dictionary
A sample data dictionary* is given right:
7
* More details are found in the project report, which are
not released at the request of the Social Enterprise
An Application of Logistic Regression Modelling
8. Modelling Strategies
Plan
1. Perform Dummy
Encoding
2. List Variables for
Modeling
3. Identify metric of
interest to judge
model's performance
Build
4. Build Logistic
Regression Model
(Preliminary Model)
5. Observe the metrics
of the model
Improve
6. Identify the
significant variables
7. Rebuild model
8. Observe the metrics
of the models
Decide
9. Compare the results
of Logistic Regression
model (Base model)
and Decision Tree
Model
10. Conclude on best
model for this project
Recommend
11. Determine factors
driving the lead
conversion process
12. Recommend what
that may help to
identify which leads
are more likely to
convert to paying
customers
1 2 3 4 5
8
An Application of Logistic Regression Modelling
9. 1. Perform Dummy Encoding
9
1. The ‘CSV Read’ and ‘Data
Explorer’ nodes were dragged and
dropped onto the KNIME Platform
to ingest and explore the variables
and data in the dataset: no
missing values are found
2. The ‘One_to_Many’ node was used to
converting non-numeric variables into the
numeric form. Except for the ‘ID’ and
‘status’ variables, the rest of the non-
numeric variables, totaling 9, were ported
to the ‘Include’ box of the ‘Configure’
function of the ‘One_to_Many’ node
An Application of Logistic Regression Modelling
10. 10
3. The final operation was to
filter the columns, using the
‘Column Filter’ node to remove
at least one of the dummy
encoded responses from each
of the non-numeric variable
An Application of Logistic Regression Modelling
4. This the list of 9 non-numeric
variables that can be used for modeling
2. Identify Variables for Modeling
11. 11
Goal
To predict whether the lead is converted
to a paid customer or not, but the model
could make these wrong predictions and
produce the following consequences:
False Negatives
The model predicts that the lead will not
convert to a paid customer, but the lead
converts. The impact is spending resources
when it is not needed since the lead has
already intended to convert, thereby affecting
cost minimisation
False Positives
The model predicts that the lead will convert
to a paid customer, but the lead has not. So,
the impact is the loss of customer because of
little or no effort toward creating conversion
since sales and marketing team believed that
the lead will convert without investing in him
An Application of Logistic Regression Modelling
To be able to minimise cost and to convert
customers, the model has to reduce both its
False Negatives and False Positives as
wrongly identifying leads who may not
convert will affect the sales from the
customer base or monetary resources used
on the leads
These are the metrics of interest:
• Accuracy = TP + TN / (TP + FP + FN +
TN)
• Precision = TP / (TP + FP)
• Recall or sensitivity = TP / (TP+FN)
• F1 Score = (2 X Precision X Recall)/
(Precision + Recall)
A good model should have a high F1 Score!
3. Select Metric of Interest
12. 12
4. Build Logistic Regression Model
An Application of Logistic Regression Modelling
1. ‘Partitioning’ node
was configured to split
the dataset in training
and testing sets by the
ratio of 7:3
2. The ‘Logistic
Regression Learner’ was
created with ‘status’ as
‘Target’
3. Two sets of ‘Logistic
Regression Predictor’,
‘Scorer’ and ‘ROC Curve’
were created; one to
ingest the training dataset
and the other to churn the
data from the testing
dataset
13. 13
5. Observe Metrics of Model
An Application of Logistic Regression Modelling
After feeding the training and testing dataset, from the ‘partitioning’ node, into these
nodes, their scorers produced the following metrics with their corresponding ROC Curves
Training Dataset Scorer
ROC Curves
14. 14
5. Observe Metrics of Model
After feeding the training and testing dataset, from the ‘partitioning’ node, into these
nodes, their scorers produced the following metrics with these corresponding ROC Curves
Testing Dataset Scorer
An Application of Logistic Regression Modelling
• The model’s performance is observed to be nearly the same for
the Training and Testing dataset. The overall accuracy on training
and testing data is 0.823 and 0.814, respectively
• The Recall for predicting the ‘1’ class is around 0.651, which
suggests that there are rows that are incorrectly predicted as ‘0’.
The F1-score for predicting ‘1’ class is around 0.67 because the
recall (at 0.65) and precision (at 0.70) for predicting ‘1’ are low
• The AUC score for test data is observed to be around 0.87. As the
difference between the training and testing dataset metrics is
within 10%, the model is not overfitted
• As the F1 score is low, other modeling methods should be explored
to improve its performance
ROC Curves
15. 15
The p-value measures the significance of observational data. In the dataset, there are
7 variables which p-values are more than 0.05, starting with ‘Yes_digital_media’ at
0.608. Typically, p-value that is less than or equals to 0.05 is statistically significant,
which helps to determine if the observed relationship that arises is not a result of
chance
An Application of Logistic Regression Modelling
6. Identify Significant Variables
16. 16
The model was rebuilt using the following
steps:
• Shift the variable with the highest p-value,
that is >0.05, to the ‘Exclude’ box of the
‘Configure’ function of the ‘Logistic
Regression Learner’
• Using the remaining variable, re-execute the
node
• Observe the changes in the p-values
through the ‘Coefficients and Statistics’
function of the node
• Identify the next variable with the highest
p-value
• Continue to iterate the process until all p-
values of remaining variables are ≤ 0.05
An Application of Logistic Regression Modelling
7. Rebuild the Model
17. 17
These are the nine variables with p-value ≤ 0.05 that are
retained to rebuilt the model since they are statistically significant
An Application of Logistic Regression Modelling
7. Rebuild the Model
18. 18
After the model has been rebuilt, the scorers and ROC Curves for the training and testing
dataset show the following information:
An Application of Logistic Regression Modelling
8. Observe Metrics of Models
Training Dataset Scorer
ROC Curves
19. 19
An Application of Logistic Regression Modelling
8. Observe Metrics of Models
Testing Dataset Scorer
• The rebuilt model’s performance is observed to be nearly the same
for the Training and Testing dataset
• The overall accuracy on training and testing data is 0.819 and 0.818,
respectively. This is just 0.004 from 0.823 and 0.814 of the last
Logistic Regression Model
• For Testing data, the Recall for predicting the ‘1’ class is around
0.658, as compared to the previous model of 0.651
• The F1-score for predicting ‘1’ class is around 0.68 (the last was
0.67)
• The AUC score for test data is observed to be around 0.87, which is
the same as the last model
• As the difference between the training and testing dataset metrics is
within 10%, the model is not overfitted
• These suggest that the rebuilt model has not improved after removing all the insignificant variables
• As the rebuilt model’s F1 score is low, other modeling methods should be explored to improve its performance; a
recommendation which was provided earlier, which the rebuilt model confirms
ROC Curves
20. 20
9 & 10. Which’s the Better Model?
A Decision Tree Model was created, using the same dataset, and its metrics of interest
were compared with the rebuilt Logistic Regression Model:
An Application of Logistic Regression Modelling
• The numbers shared in green and yellow show that the metrics of interest outcomes for
the Decision Tree are superior than the Logistic Regression Model (Rebuilt), and should
be the model to use to predict whether the lead is converted to a paid customer or not.
• Nevertheless, given this benefits in the metrics, Decision Trees have limitations. These
include its instabilities and prediction accuracy is not as good as the more complicated
models, like the Logistic Regression approach.
• For these disadvantages, the rebuilt Logistic Regression Model has been used for the
final analysis
21. 11. Factors Driving Lead Conversion*
It is observed that the leads with the following features have positive impact on the
conversion of leads to paying customers:
• High and medium level of the leads’ profiles been filled on the website/mobile app
• Leads who first interacted with Social Enterprise through its website
• When the current occupation of the leads is in the professional field or is unemployed
• That the lead heard about Social Enterprise through references
• Where the lead’s last interaction with Social Enterprise was through his activities on
its website (on live chat with a representative, updated profile on the website, etc)
and through emails (by seeking details about the programme through email,
representative shared information with a lead like a brochure of programme, etc)
21
An Application of Logistic Regression Modelling
* More details are found in the project report, which are
not released at the request of the Social Enterprise
22. 12. Which Leads are Likely to Convert*
An increased one unit in filling of the leads’ profile on Social Enterprise’s
website/mobile app, having one unit more of leads interacting with Social
Enterprise’s Representatives through its website, and targeting marketing
and sales outreach to get one more unit of leads who work in the
professional field would positively increase the conversion rate of leads into
paying customers
These are insights useful in informing decisions relating to creating positive
total leads’ experience on the use of social medias managed by Social
Enterprise, relating to the marketing efforts in increasing the awareness of
these social medias amongst the leads, and relating to fine-tuning the
marketing and product mix that appeal to the professionals
22
An Application of Logistic Regression Modelling
* More details are found in the project report, which are
not released at the request of the Social Enterprise