Presentation - Predicting Online Purchases Using Conversion Prediction Modeling 8.19.2016

Rakesh Gupta1, Chris Sneed1,Vipul Tyagi1
1College of Computing and Technology,
Lipscomb University, Nashville, TN, USA
Predicting Online Purchases Using
Conversion Prediction Modeling
1

Executive Summary
• Homesite Group Inc. sponsored a Kaggle* competition to
understand how they could better predict what price will
entice it’s quote seekers to purchase a home insurance policy.
• The outcome of this research will be important to the field of
retail sales, with special importance to online sales
• The benefits of this implementation for Homesite are more
sales from its leads through effective product pricing.
• In this presentation, our team will demonsrate the process we
followed to create the model and our results in predicting the
data
*https://www.kaggle.com/c/homesite-quote-conversion

*U.S. Census Bureau News. Quarterly Retail E-Commerce Sales for 1st Quarter 2016. (May, 2016).
*

Sales Lead
Articles
History
Predictive
Models
Sales and Lead
Cycle Research
Sales Pricing Models
Classification
Algorithms
Naïve Bayes
Neural Networks
Binary Logistic
Regression
AdaBoost
Patents
Sales Lead
Prioritization
Lead
Conversion
Predicting Online Purchases – A Comparison of
Machine Learning Approaches
Dynamic Pricing
Sales Lead
Conversion
Weighted KNN
Gradient Boosting
Decision Trees
CART
C5.0
CHAID
Support Vector
Machines

Decision Trees
CART
C5.0
Naïve Bayes
Neural Networks
Binary Logistic
Regression
AdaBoost
Weighted KNN
Gradient
Boosting
CHAID
Support Vector
Machines
Classification Algorithms

Data Source Analysis
• Data from Homesite was relatively clean to begin with
• The dataset had 299 predictor variables and one target variable:
“QuoteConversion” Flag.
– Target variable has the values : 0 or 1
• Data collected had a train dataset of 260K records and test dataset
of 173K records
• During analysis, we removed the variable “QuoteDate” and the
following variables:
Summary Statistics
Variable Name GeographicField10A GeographicField10B PersonalField84 PropertField29 PropertyField6
Min -1 -1 1 0 0
1st Quartile -1 25 2 0 0
Median -1 25 2 0 0
Mean -1 25 1.99 0 0
3rd Quartile -1 25 2 0 0
Max -1 25 8 10 0
NAs 207020 334630

Data Cleansing & Preparation
• Categorical variables conversion to numeric
– 27 variables converted
• 293 predictor variables in the full training set
• Multiple split ratios of train/test
– 90/10
– 80/20
– 67/33
• Randomized sample
• Multiple iterations

Classifications & Platforms
• R – open source statistical tool
– Naïve Bayes
– Logistic Regression
– Boosting
• Python – open source programming platform
– Naïve Bayes
– kNN
– Logistic Regression

Naïve Bayes*
• Naive Bayes is a simple technique for constructing classifiers.
• Models that assign class labels to problem instances, represented as
vectors of feature values.
• All naive Bayes classifiers assume that the value of a particular
feature is independent of the value of any other feature, given the
class variable.
• The method of maximum likelihood is applied for parameter
estimation for naive Bayes models.
• Despite the naive design and apparently oversimplified assumptions,
naive Bayes classifiers have worked quite well in many complex real-
world situations.
• An advantage of naive Bayes is that it only requires a small amount of
training data to estimate the parameters necessary for classification.
• Our team used Gaussian Naïve Bayes as it is good for continuous data
*Naïve Bayes classifier. (n.d.). In Wikipedia. Retrieved from
https://en.wikipedia.org/wiki/Naive_Bayes_classifier

Logistic Regression*
•Binary logistic regression
as our target variable is 0
or 1
•Predicts probabilities of
dependent variable
*Logistic Regression. (n.d.). In Wikipedia. Retrieved from
https://en.wikipedia.org/wiki/Logistic_regression

kNN*
• An object is classified by a majority vote of its neighbors, assigning it to
the “nearest” neighbor
• The nearer neighbors contribute more to the average than the distant
ones
• Sensitive to the local structure of the data
*k-nearest neighbors algorithm. (n.d.). In Wikipedia. Retrieved from
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

Boosting*
•Boosting is a general method for improving
the accuracy of any given learning algorithm
•Works by combining rough and less than
accurate rules of thumb
• Produce a classifier with a
low generalization error
• Increase weights on
incorrectly classified
examples, forcing the base
learner to focus it’s attention
on them
*Schapire, Robert E. and Freund, Yoav. Boosting: Foundations and Algorithms. Massachusetts Institute
of Technology, Cambridge, MA. 2012

Trials & Tribulations
• Neural
Networks?
• CSV
Vector?
Mahout
• Output of
model
• Learning
curve
RapidMiner
• Complicated
to fit model
SVM
• VIF
Functions
• Corrgrams*
Multicollinearity
Analysis
*Package ‘corrgram’ Retrieved from https://cran.r-project.org/web/packages/corrgram/corrgram.pdf

Correlation Analysis
*Package ‘corrgram’ Retrieved from https://cran.r-project.org/web/packages/corrgram/corrgram.pdf

Results - Accuracy Matrices
“No Models are perfect, but some are better than others…”
Technology Classifier
Naïve
Bayes KNN
Logistic
Regression
HS Test File
0’s
1’s
Python
Split Ratio
90/10 81% 78.34% 81.33%
0’s = 168422
1’s = 5414
Python
Split Ratio
80/20 78.47% 81.15%
0’s = 165859
1’s = 7977
Python
Split Ratio
67/33 78.64% 81.13%
0’s = 165870
1’s = 7966
R
Split Ratio
80/20 71%
R
Split Ratio
80/20
0’s =
124,544
1’s = 49,292

Conclusion & Discussion
• Boosting helped identify the 6 variables that provided the
most value
• We know we can predict a sale from a lead about 80% of the
time given Homesite’s data set
• We reduced the number of predictor values from 292 to 6!
• This allows Homesite to focus on these data points.
• Following the 80/20 Pareto principle – From these 6
predictors we get 80% of the benefit without wasting time on
the other factors that don’t carry as much weight.
• Simple, fast market strategy that will provide immediate
benefits in terms of increased sales and revenue for Homesite

Future Works
• Continue work on additional data cleaning to
improve accuracy of the model from 81% to 97%
• Investigate the use of the remaining
classification models to see if we achieve better
results
• Design and build a process to provide real-time
prediction as new quotes are sent out by
HomeSite.
• Complete ANOVA analysis to determine strength
of logistic regression model

Presentation - Predicting Online Purchases Using Conversion Prediction Modeling 8.19.2016

Recommended

Recommended

More Related Content

Similar to Presentation - Predicting Online Purchases Using Conversion Prediction Modeling 8.19.2016

Similar to Presentation - Predicting Online Purchases Using Conversion Prediction Modeling 8.19.2016 (20)

Presentation - Predicting Online Purchases Using Conversion Prediction Modeling 8.19.2016