1. Application to Typing Tools
By
Bryan Butler
UNSUPERVISED AND
SUPERVISED MACHINE
LEARNING IN
MARKETING WITH R
2. GOAL: DEVELOP A DYNAMIC SEGMENTATION
TOOL
• Develop a segmentation engine that takes in customer survey responses and segments
them according to their needs
• Segmentation is a very common market research revenue driver
• Critical aspect to segmentation analysis is validation and reproducibility of the model
• Do the segments hold up over time?
• Behavioral/Psychographic segmentation can be blended with traditional demographic or
other segmentations for a finer approach
• Provides a multi-dimensional approach to explain WHY a segment acts in certain
ways
• In this data set, the survey is designed to reveal a series of attributes that help match why
a customer chooses or does not choose to use a company’s products and services
• The tool rates company on customer connection attributes
3. PARAMETERS AND CONSTRAINTS
• Client Specification: Final tool must be built into Excel
• Common client request, but with significant impact on the choice of model, process,
tools, etc.
• Sample size
• ~900 respondents may not be enough for a larger amount of segments
• Fit 3 -4 segments
• Requires unsupervised learning as first step
• There is no dependent or outcome variable already in the dataset
• Supervised learning to predict clusters
• Dimensional reduction is important part of process
• Questions to consider:
• How much error is acceptable to the end user?
• Are there penalties for false positives and false negatives?
4. PROJECT ROADMAP
• Design survey, collect a sufficiently large dataset
• Hierarchical Clustering: find clusters using unsupervised learning
• Create dummy variables for each segment
• Multinomial modeling assumptions not likely to hold
• Supervised learning with GLMNET
• Reduce dimensionality
• Fit reduced logistic regressions to each segment
• Employs a “voting” method to choose segment
• Easily embedded in Excel
• Can see the exact drivers of each segment
5. BEST CLUSTERS OF 3 OR 4
The bend in the plot is the number of segments
6. DENDROGRAM OF 3 CLUSTERS
Small Segment – Difficult to Predict
7. OVERLAY CLUSTER ANALYSIS TO EXISTING
SEGMENTS
Psychographic segmentation consists of three groups vs 8 stated segments
Reinforces the selection of 3 segments over 4
Responses to the questions when compared to the segments are shown below:
1 2 3 4 5 6 7 8
Care Organization 25 48 2 2 29 2 0 6
Convenience Store/Reseller 4 15 0 0 13 0 0 1
Foodservice/Restaurant 13 24 1 5 23 2 0 7
Large Family 35 56 1 2 48 1 1 1
Neighborhood Family 20 51 1 2 20 2 0 6
New Mom 32 57 2 7 25 2 1 1
Professional Services Business 30 69 0 1 38 2 0 6
Social Couple 39 76 2 2 31 4 0 6
8. SUPERVISED LEARNING - GLMNET
• Choose GLMNET model for high performance
• Expect to find the upper bound of accuracy
• Easier to interpret than RF, GBM
• Create dummy variables for each segment
• End result will be three binary/logistic regressions; one for each segment
• Use the probability output rather than classification to allow for “voting”
• Ex. Prob(Segment1) = .21, Prob(Segment2) = .55, Prob(Segment3) = .90
• Respondent assigned to Segment3
• Split data into training and testing sets
• Use a 70/30 split
9. GLMNET PERFORMANCE ON SEGMENT 1 – VERY
HIGH
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 164 1
1 1 104
Accuracy : 0.9926
95% CI : (0.9735, 0.9991)
No Information Rate : 0.6111
P-Value [Acc > NIR] : <2e-16
Segments 2 and 3 had accuracy of 94% and 80%
11. MODEL REDUCTION – LOGISTIC REGRESSION
FOR SEGMENT 1
• Dimensional reduction distilled the model for Segment 1 to four questions
• Q16: Appreciates my loyalty
• Q22: I feel proud
• Q25: Sense of belonging
• Q13: Use own products/services
• Segment is focused on the values provided by emotional validation and its associated
benefits
12. LOGISTIC REGRESSION PERFORMANCE ON
SEGMENT 1
Reference
Prediction Other Seg1
Other 159 11
Seg1 6 94
Accuracy : 0.937
95% CI : (0.9011, 0.9629)
No Information Rate : 0.6111
P-Value [Acc > NIR] : <2e-16
13. SEGMENT 2 VARIABLE IMPORTANCE
Best Predictors of
the Segment
Characteristics of Other Segments
14. SEGMENT 2 – LOGISTIC REGRESSION MODEL
PERFORMANCE
Model uses reduced set of questions: Q10, Q3, Q23,Q9
Focus of questions is customer service
Reference
Prediction Other Seg2
Other 238 8
Seg2 7 17
Accuracy : 0.9444
95% CI : (0.91, 0.9686)
No Information Rate : 0.9074
P-Value [Acc > NIR] : 0.01785
16. SEGMENT 3 – LOGISTIC REGRESSION MODEL
PERFORMANCE
Model uses reduced set of questions: Q16, Q23
Focus of question is value
Reference
Prediction Other Seg3
Other 103 31
Seg3 27 109
Accuracy : 0.7852
95% CI : (0.7313, 0.8327)
No Information Rate : 0.5185
P-Value [Acc > NIR] : <2e-16
17. DEVELOPING THE FINAL ENGINE
• GLMNET was used to find the highest performing model and also reduce the dimensionality of
the survey to a focused set of questions
• Reduced survey for tool from 17 to 9 questions
• Logistic regressions were fit from the GLMNET output based on variable importance
• Generally performed with good accuracy, but lower than GLMNET
• Performance evaluated with CV, ROC and Confusion Matrix
• One model was developed for each segment
• Final assignment made based on a voting approach
• Final test made across all survey respondents
• Smallest segment had most error as expected
• Overall model accuracy was 85%; acceptable to the client
• No penalties for misclassification