There are different ways to segment users - marketers and user researchers typically use the interview or survey methods and segment users based on attitudes or intentions. By contrasts, data analysts can use behavioral data to generate segments in a semi-automatic fashion and with little assumption about a user's needs or intentions. This approach is particularly useful for understanding product usage, tracking user growth, and informing product decisions. But blindly applying clustering methods can produce woefully bad results. We will walk through a real-world segmentation exercise run at VSCO to illustrate how to properly apply clustering methods and avoid analysis pitfalls.
Data based user segmentation - a practical guide for data analysts
1. VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
Data Update - 01/27/2016vsco.co/blevishkin
Data Update - 03/17/17vsco.co/prazakj
07 DEC 2017
RUBEN KOGEL ( VSCO )
RUBEN@VSCO.CO
@CHILICONDATA on Twitter
Data-based User
Segmentation
5. VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
where do you draw the line??
0 20 40 60 80 100
0102030
editing usage
number of actions
numberofpeople(inthousands)
the practice
0 20 40 60 80 100
01020304050
sessions
number of actions
numberofpeople(inthousands) 0 20 40 60 80 100
010203040
publishing usage
number of actions
numberofpeople(inthousands)
6. VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
→ k-means find the dimensions with the most separation and use that information to form “clusters”
• each additional dimension will change the output - but does it add information?
→ eliminate unnecessary input variables
• use intuition and data exploration
→ segment only on the things that matter:
• age on the platform
• sum of past behavior
• current behavior - what we want to model
→ this is an iterative process: re-do this step after running the clustering algorithm
step 1: choose the right inputs
7. VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
step 2:
0 20 40 60 80 100
0200004000060000
0 1 2 3 4
010000200003000040000
→ otherwise your model assumes the gap between people
editing 1 and 2 photos counts the same as between people
editing 101 and 102 photos
→ log transform so that the gap between few actions gets
blown up and the gap between large numbers get shrieked
• log(2) - log(1) = 0.69
• log(102) - log(101) = 0.01
10. VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
step 5: use programmatic rules to track segments
→ what happens if we re-compute the clusters every month?
• k-means will define different looking clusters for every different dataset
• a user classified “super editor” one period might be classified “casual editor” the next period with
the exact same behavior
→ instead infer the segment boundaries from the cluster analysis and use these set boundaries to classify
users on an on-going basis
• more stable
• easier to explain
12. VSCO→CONFIDENTIAL→DONOTDISTRIBUTEVSCO→CONFIDENTIAL→DONOTDISTRIBUTE
Summary
→ marketers, designers, and analysts use different
but complementary segmentation approaches
→ data-based segmentation is useful to track
usage; should be based on behavioral data only
→ most usage data is exponential so need log
transform and machine algorithms to identify
cluster boundaries
6 steps to doing a clustering analysis
1. choose the right inputs
2. log transform (almost) everything
3. choose the number of clusters that make sense
4. deliver the insights in an intuitive way
5. use programmatic rules to track cohorts
6. deliver dashboard or on-going classification
vsco.co/sannalinn