2. 2
Why is Segmentation Difficult?
Infinite number of possible solutions
Hundreds of possible variables for to use
Clearly defined clusters are rarely present in real life data-
sets
3. 3
Technical Challenges
Challenge: Incorporating fundamentally
categorical variables
Ethnicity, Religion, Political Party, Etc.
Standard methods assume continuous data (ideal case) and
require interval level data as worst case (e.g. ratings scales)
Correlation, linear regression, k-means clustering
4. 4
Technical Solutions
Challenge: Incorporating fundamentally categorical
variables
Multiple Correspondence Analysis (factor analysis for categorical
data)
Pro: handles both demographic (categorical) and ratings variables
Would allow treating sets of variables separately (i.e. demographic,
behavioral, psychological) – these sets could be used as inputs to
clustering method
Con: segmentation would be based on extracted components
5. 5
Technical Challenges
Determining the number of clusters/segments in the
data
Standard methods require the user to specify the number of cluster
to extract
Our standard practice results in fewer clusters then input variables
e.g. AMC segmentation solutions required ~12 variables to find ~5
segments
This ratio of features-to-segments will „water-down‟ the effect of the
individual variables (segments do not differ significantly on most items)
6. 6
Technical Solution 1
Challenge: Determining the number of clusters/segments
in the data
Solution: fit a probabilistic mixture model and compute a
complexity penalized likelihood (AIC / BIC scores)
The model with the best AIC / BIC score is our best guess for the
number of natural clusters in the data
Gaussian mixture models for continuous data
Latent Class Models for categorical data
Latent class models can handle both categorical and continuous data if the
continuous data is binned.
Both of the above return BIC scores to determine the number of
clusters
11. 11
Technical Solution 2
Challenge: Determining the number of clusters/segments
in the data
Solution: ensure there are fewer input variables then extracted
clusters
2(+) segments can be obtained from
a single variable.
That is a 2-1 ratio of segments-to-
variables
For AMC & MTV we got 5 segments
from ~12 variables. A ratio of 0.4-1.
- That is less then 1 segments for
every two variables…
Also See: Van Buuren & Heiser (1989); Vichi & Kiers (2001); Hwang, Dillon, &
Takane (2006).
12. 12
Technical Challenges
Respondents vary in their use of
ratings scales
Some respondents only use part
of the scale,
Either top or bottom of range
Segmentation method will find the
high/low scale-use respondents
and define segments for them
See AMC segments,
14. 14
Technical Solution 1
Challenge: Respondents vary in their use of ratings
scales
Calibrate respondents to equate ratings scale across sample
Overcoming Scale Use Heterogeneity (2003) Peter E. Rossi
Pro: Improves the accuracy and validly of standard methods
E.g. correlation, regression, clustering
Con: requires complex and computational expensive models
i.e. hierarchical bayesian models – available as R package
15. 15
Technical Solution 2
Challenge: Respondents vary in their use of ratings
scales
Abandon rating scales – use simple Agree/Disagree variables
Focus on methods for categorical variables
Multiple Correspondence Analysis (factor analysis for categorical data)
Pro: handles both demographic (categorical) and ratings variables
Would allow treating sets of variables separately (i.e. demographic, behavioral,
psychological) – these sets could be used as inputs to clustering methods
16. 16
What Slows Us Down?
Each segmentation iteration consumes resources
Producing new segmentation variable for each respondent
.5 man hour
Producing new banners
Generating tables - .25 hours
Formatting and printing – 1+ man hours
Analyzing full banner for new segmentation
Requires entire research team, 6+ man hours
17. 17
How to Speed it up
Producing new segmentation variable for each respondent
.5 man hour – Not the bottleneck
Producing new banners
Generating tables - .25 hours – Not the bottleneck
Formatting and printing – 1+ man hours – Potential for Automation
Analyzing full banner for the new segmentation
Requires entire research team, 6+ man hours – workflow bottleneck
Ideas / brainstorm
Criteria of success is often vague
When the goal is well defined quant methods can increase efficiency
If you can formalize it you can solve it
Time invested in the planning phase will reap productivity gains during analysis
18. 18
Hypothetical Case Study
Goals Brainstorm:
Client and previous research says:
“segmentation should differentiate enthusiasts (early adopters) and utility
consumers (late adopters)”
“also, segmentation should include demographics that are known to influence
technology adoption.
Age, Gender, Income, Education
Quant answers:
“Ok, lets write a battery of questions addressing consumers perceptions and
relation to technology products – this will be distilled into a single „tech
enthusiasm‟ measure.
“Also, all relevant demographic information can be reduced into a one (or more)
demo factors
“Segments will be defined from a „reduced dimensionality‟ representation of the
data (MCA)”
23. 23
MCA for Segmentation
(2006). An extension of multiple correspondence analysis for identifying
heterogeneous subgroups of respondents
(2010). Traveler segmentation strategy with nominal variables through
correspondence analysis
(2010). Fuzzy cluster multiple correspondence analysis
(2010). Simultaneous two-way clustering of multiple correspondence
analysis
(2005). A simultaneous approach to constrained multiple correspondence
analysis and cluster analysis for market segmentation
(2002). Analysis of categorical marketing data by generalized constrained
multiple correspondence analysis
24. 24
Further Directions
Extension to Multiple Correspondence Analysis
Methods that let us combine nominal, numeric, and ordinal
variables
Methods that let us group variables into sets.
E.g. could ensures that psychographic, behavioral and demographic
have an equal influence on the final solution.
Methods that simultaneously preform dimensionality
reduction and cluster discovery
Optimizes the entire analysis to discover the most distinctive
clusters
Very promising approach
Con: I have not found an implementation of these methods.