Recommendations and User Understanding at StumbleUpon Chief Data Scientist Summit

Recommendations and User Understanding
at StumbleUpon

Chief Data Scientist Summit, San Diego, February 2013

Debora Donato
Principal Data Scientist

Slides courtesy of
Vishal Vaingankar, Tim Abraham, Roberto Sanabria, Ulas Bardak

StumbleUpon’s Mission

Help users find content they did not expect to find
Be the best way to discover new
and interesting things from across
the Web.

How StumbleUpon works
1. Register 2. Tell us your interests 3. Start Stumbling and
rating web pages

We use your interests and behavior to
recommend new content for you!

• Single item type
• No serendipity
• << 100K items
• Many at a time
• < 250 categories
• Not personalized*
• Hand-labeled
• Repeats
• ~27M users

• +100M items
• >600 recs/mo.
• Auto features
• ~200 methods

• Mostly about
• Hand-labeled
presentation
• Item-item
• Social recs only
similarity based
• 10 million
methods
recs/month

Data-driven culture

Data science

Applied
Analytics
Research

15% of the total work force

Extensive A/B Testing

AB Tests on metrics such as session length, retention,
rating behavior etc

Outline of the talk
• The recommendation pipeline

• Showcases:
– Mobile optimization
– Power User Understanding
– Lists

Discovery is very different from search

Discovery at StumbleUpon Search
Serendipitous Intent driven
One at a time List of articles
Never repeats Always repeats
Constantly adapting Fixed results
Tailored for you Impersonal

There is a ongoing shift from search to discovery

StumbleUpon Overview
1 Users Automated
URL Index
Discovery Feeds

3

Ingestion
Pipeline Rec Engine
Yes
2
Pass
Sampling ?

Grow User’s Interest Graph:
Implicit + Explicit

Experts Friends

Likeminded
Users News

User
Food/ Trending
Italian
Recipes Cooking

Cars nasa.gov

Vintage 1x.com
Cars

Changing Ecosystem

100%

75%
Percent of Total Stumbles

Source
50% mobile
desktop

25%

0%

2011−01 2011−07 2012−01 2012−07 2013−01
Date

Webpages on Desktop Vs. Mobile

Finding mobile optimized content

Content Features
HTML tags
#links P (URL_good | {f1, f2,…..}) = ?
#images
#videos

User Feedback

P (URL_good | {f1, f2,…..}) = ?

User Feedback signals to determine mobile
optimization

CDF of thumbed-up
URL is skipped when
stumbles
timespent <= skip_threshold

# skips
Skip_rate =
# stumbles

0.05
Skip threshold Time (sec)
(secs)

Cross-device skip rate prediction
URLs worse on
mobile vs desktop URLs bad on
Both devices

Mobile Skiprate

URLs good on
Both devices

Desktop Skiprate

E[Mobile_skiprate] = Desktop_skiprate x Slope + Bias

Power user definition
• Is a loyal user who has been
stumbled, even occasionally,
for years?

• Is a user who regularly
stumbles (daily or weekly)?

• Is a user who is able to
discover good content?

• Or one who interacts (rates,
creates lists, shares contents,
invites friends)?

Stumble rate

• Sample of ~5M users active in the last 3 • max dist. cut off: 25.2 SPD
months • 50% dist cut off: 31.7 SPD
• Excluded users that had < 10 DOA
• Global avg: 39.2 SPD
• Top 10% avg: 71 SPD
• 25% of users have SPD >= 31.7

Activity Day Rate

# active _ days
ADR = • Max error: ~70%, 1.3% of the observations
above that rate.
account _ age • Intercept: ~85%, 0.25% of the observations
above that rate.

Ranking users and content
1 1 1

Content discovery

i r_ij j Content “likes”

n
m

Normalizations

• By the total number of object discovered

• By the total number of rates

• By the total number of Stumbles of the
pages

• By keeping into account time of the rate

Lists
• Released in
September 2012

• 45,000 lists
created in the
first months

• 2.9M total lists by
February 2013

List by numbers

• Percentage of users who created more
than 1 list in their first week of activity:
10%

• Percentage of users who added at least 2
pages to a list in their first week of activity:
15%

URLs distribution

20
Number of URLs in List

10

0

0% 25% 50% 75% 100%
Quantile

List distribution by number of topics

1e+05
Count

5e+04

0e+00
151
0 25 50 75
Number of Topics in Lists

Topic Classification - Minos

Cleanup
Remove stopwords, numbers

Stem
Remove suffixes

p (Ci w1, w2 ,… , wd ) = p ( w1, w2 ,… , wd Ci ) × p (Ci )
Build n-grams
n
Combinations of sequential words
p (W Ci ) = Õ p ( wk Ci )
k=1
n

( ) ( ) Õ p ( wk Ci )
Wiki check
Eliminate tokens notp Ci × in
p Ci W = existing
English Wikipedia as articles k=1

p (Ci w1, w2 ,… , wd ) = p ( w1, w2 ,… , wd Ci ) × p (Ci )
n
p (W Ci ) = Õ p ( wk Ci )
k=1
n
p (Ci W ) = p (Ci ) × Õ p ( wk Ci )
k=1

List Recommendation

?

List Recommendation

Vintage Cars
Action movies Astronomy
Astronomy Space Exploration
Robotics
Physics
Classic Movies

Movies
Cars Space
Neuroscience
Astronomy
Space Exploration
Science Comedy Movies

Many other interesting problems…

• Dupe detection
• Anti-spam
• Biases, mood
• News
• Adult content
• Metrics
• Trending
• Many more…

Recommendations and User Understanding at StumbleUpon Chief Data Scientist Summit

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Recommendations and User Understanding at StumbleUpon Chief Data Scientist Summit

Similar to Recommendations and User Understanding at StumbleUpon Chief Data Scientist Summit (20)

Recently uploaded

Recently uploaded (20)

Recommendations and User Understanding at StumbleUpon Chief Data Scientist Summit

Editor's Notes