VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
Data Update - 01/27/2016vsco.co/blevishkin
Data Update - 03/17/17vsco.co/prazakj
06 APR 2017
RUBEN KOGEL ( VSCO )
RUBEN@VSCO.CO
@CHILICONDATA
Data-based User
Segmentation
VSCO→CONFIDENTIAL→DONOTDISTRIBUTEVSCO→CONFIDENTIAL→DONOTDISTRIBUTE
What is VSCO?
→ tools for expression
→ a community to share, learn, and discover
vsco.co/mikelyon
VSCO→CONFIDENTIAL→DONOTDISTRIBUTEVSCO→CONFIDENTIAL→DONOTDISTRIBUTE
Who are VSCO users?
→ 41M monthly audience
→ 12B images served monthly
→ 70% of daily audience create
→ 73% under 25
→ 76% female
→ 81% international
• North America (22%)
• Southeast Asia (20%)
• China (16%)
• Europe (14%)
vsco.co/curtsaunders
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
Segment
iOS

client
Android

client
App Store
Prod
Database
VSCO Web
3rd party
(AppAnnie..)
SQL client
Mixpanel SDK
Product

Design

Engineering

Finance
Analysts
Content

Investors

Leadership

(everyone)
periodic

delete
Data Stack at VSCO as of 3/31
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
Cantor
iOS

client
Android

client
App Store
Prod
Database
VSCO Web
3rd party
(AppAnnie..)
SQL client
(Presto/
Spark)
Mixpanel SDK Events Analytics
Deep AnalysisKafka
Data Exploration
Dashboarding
New Data Stack at VSCO (in-progress)
VSCO→CONFIDENTIAL→DONOTDISTRIBUTEVSCO→CONFIDENTIAL→DONOTDISTRIBUTE
Why segment?
→ marketing: who should we target?
→ design: what usage do we design for?
→ product / officers: how do we grow usage?
vsco.co/evanhundelt
VSCO→CONFIDENTIAL→DONOTDISTRIBUTEVSCO→CONFIDENTIAL→DONOTDISTRIBUTE
Why segment?
vsco.co/evanhundelt
method insights goal
Marketers interviews persona define target audience
Designers interviews intention design intuitive UI
Analysts data behavior track usage, conversion
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
the theory
meat consumption
dairyconsumption
paleo
German dietFrench diet
mediterranean
diet
usage frequencymilesdriven
commuters
taxi driversweekenders
greenies
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
where do you draw the line??
0 20 40 60 80 100
0102030
editing usage
number of actions
numberofpeople(inthousands)
the practice
0 20 40 60 80 100
01020304050
sessions
number of actions
numberofpeople(inthousands) 0 20 40 60 80 100
010203040
publishing usage
number of actions
numberofpeople(inthousands)
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
→ k-means find the dimensions with the most separation and use that information to form “clusters”
• each additional dimension will change the output - but does it add information?
→ eliminate unnecessary input variables
• use intuition and data exploration
→ segment only on the things that matter:
• age on the platform
• sum of past behavior
• current behavior - what we want to model
→ this is an iterative process: re-do this step after running the clustering algorithm
step 1: choose the right inputs
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
step 2:
0 20 40 60 80 100
0200004000060000
0 1 2 3 4
010000200003000040000
→ otherwise your model assumes the gap between people
editing 1 and 2 photos counts the same as between people
editing 101 and 102 photos
→ log transform so that the gap between few actions gets
blown up and the gap between large numbers get shrieked
• log(2) - log(1) = 0.69
• log(102) - log(101) = 0.01
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
step 3: choose the number of clusters that make sense
balance:
→ sparseness
→ interpretability
• does it match intuition?
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
step 4: deliver the insights in an intuitive way
1 2 3 4 5 6
dimension 0.0 0.0 0.0 0.9 2.8 0.5
dimension 0.0 0.0 0.0 0.6 1.9 0.3
dimension 0.0 0.0 0.0 0.5 1.5 0.3
dimension 0.2 0.1 0.1 8.5 18.4 2.5
dimension 0.2 0.1 0.1 3.1 3.9 1.4
dimension 0.3 4.8 27.1 2.1 20.5 22.7
dimension 0.3 2.5 7.6 1.3 7.7 6.9
dimension 0.3 1.9 3.3 1.1 3.4 3.3
dimension 0.2 3.6 21.4 0.3 3.4 7.3
dimension 0.1 0.2 0.1 2.7 13.0 10.5
dimension 0.1 0.1 0.1 1.6 6.5 4.1
dimension 0.1 0.1 0.1 1.3 3.2 2.5
dimension 0.0 0.0 0.0 0.5 6.4 0.1
dimension 0.0 0.0 0.0 0.4 4.2 0.1
dimension 0.0 0.0 0.0 0.4 2.5 0.1
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
step 5: use programmatic rules to track cohorts
→ what happens if we re-compute the clusters every month?
• the algorithms will find different cohorts with different centers with every new dataset
• a user that was classified as a “super editor” one month might, with the same behavior, be classified
as a “casual editor” the next month
→ instead deduct the boundaries between the different groups from the initial cluster analysis and design
programmatic rules to classify users on an on-going basis
• more stable
• easier to explain
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
step 6: deliver dashboard or on-going classification
segmentation, over time source of the “green” segment, in each month
VSCO→CONFIDENTIAL→DONOTDISTRIBUTEVSCO→CONFIDENTIAL→DONOTDISTRIBUTE
Summary
→ marketers, designers, and analysts use different
but complementary segmentation approaches
→ data-based segmentation is useful to track
usage; should be based on behavioral data only
→ most usage data is exponential so need algos to
identify cluster boundaries
6 steps to doing a clustering analysis
1. choose the right inputs
2. log transform (almost) everything
3. choose the number of clusters that make sense
4. deliver the insights in an intuitive way
5. use programmatic rules to track cohorts
6. deliver dashboard or on-going classification
vsco.co/sannalinn
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
Questions?
VSCO→CONFIDENTIAL→DONOTDISTRIBUTE
→

→
06 APR 2017
RUBEN KOGEL ( VSCO )
RUBEN@VSCO.CO
@CHILICONDATA

Data based segmentation @ vsco