StumbleUpon provides personalized recommendations to help users discover new content across the web. They analyzed user data and conducted A/B tests to optimize recommendations for mobile users. They defined power users as those who regularly discover and interact with content. StumbleUpon also introduced lists, which over 45,000 users created in the first few months to organize content by topic. Data-driven techniques like topic modeling were used to recommend additional lists to users.
UiPath Community: Communication Mining from Zero to Hero
Recommendations and User Understanding at StumbleUpon Chief Data Scientist Summit
1. Recommendations and User Understanding
at StumbleUpon
Chief Data Scientist Summit, San Diego, February 2013
Debora Donato
Principal Data Scientist
Slides courtesy of
Vishal Vaingankar, Tim Abraham, Roberto Sanabria, Ulas Bardak
2. StumbleUpon’s Mission
Help users find content they did not expect to find
Be the best way to discover new
and interesting things from across
the Web.
3. How StumbleUpon works
1. Register 2. Tell us your interests 3. Start Stumbling and
rating web pages
We use your interests and behavior to
recommend new content for you!
5. • Single item type
• No serendipity
• << 100K items
• Many at a time
• < 250 categories
• Not personalized*
• Hand-labeled
• Repeats
• ~27M users
• +100M items
• >600 recs/mo.
• Auto features
• ~200 methods
• Mostly about
• Hand-labeled
presentation
• Item-item
• Social recs only
similarity based
• 10 million
methods
recs/month
6. Data-driven culture
Data science
Applied
Analytics
Research
15% of the total work force
8. Outline of the talk
• The recommendation pipeline
• Showcases:
– Mobile optimization
– Power User Understanding
– Lists
9. Discovery is very different from search
Discovery at StumbleUpon Search
Serendipitous Intent driven
One at a time List of articles
Never repeats Always repeats
Constantly adapting Fixed results
Tailored for you Impersonal
There is a ongoing shift from search to discovery
16. Finding mobile optimized content
Content Features
HTML tags
#links P (URL_good | {f1, f2,…..}) = ?
#images
#videos
User Feedback
P (URL_good | {f1, f2,…..}) = ?
17. User Feedback signals to determine mobile
optimization
CDF of thumbed-up
URL is skipped when
stumbles
timespent <= skip_threshold
# skips
Skip_rate =
# stumbles
0.05
Skip threshold Time (sec)
(secs)
18. Cross-device skip rate prediction
URLs worse on
mobile vs desktop URLs bad on
Both devices
Mobile Skiprate
URLs good on
Both devices
Desktop Skiprate
E[Mobile_skiprate] = Desktop_skiprate x Slope + Bias
22. Power user definition
• Is a loyal user who has been
stumbled, even occasionally,
for years?
• Is a user who regularly
stumbles (daily or weekly)?
• Is a user who is able to
discover good content?
• Or one who interacts (rates,
creates lists, shares contents,
invites friends)?
23. Stumble rate
• Sample of ~5M users active in the last 3 • max dist. cut off: 25.2 SPD
months • 50% dist cut off: 31.7 SPD
• Excluded users that had < 10 DOA
• Global avg: 39.2 SPD
• Top 10% avg: 71 SPD
• 25% of users have SPD >= 31.7
24. Activity Day Rate
# active _ days
ADR = • Max error: ~70%, 1.3% of the observations
above that rate.
account _ age • Intercept: ~85%, 0.25% of the observations
above that rate.
25. Ranking users and content
1 1 1
Content discovery
i r_ij j Content “likes”
n
m
26. Normalizations
• By the total number of object discovered
• By the total number of rates
• By the total number of Stumbles of the
pages
• By keeping into account time of the rate
28. Lists
• Released in
September 2012
• 45,000 lists
created in the
first months
• 2.9M total lists by
February 2013
29. List by numbers
• Percentage of users who created more
than 1 list in their first week of activity:
10%
• Percentage of users who added at least 2
pages to a list in their first week of activity:
15%
30. URLs distribution
20
Number of URLs in List
10
0
0% 25% 50% 75% 100%
Quantile
32. List distribution by number of topics
1e+05
Count
5e+04
0e+00
151
0 25 50 75
Number of Topics in Lists
33. Topic Classification - Minos
Cleanup
Remove stopwords, numbers
Stem
Remove suffixes
p (Ci w1, w2 ,… , wd ) = p ( w1, w2 ,… , wd Ci ) × p (Ci )
Build n-grams
n
Combinations of sequential words
p (W Ci ) = Õ p ( wk Ci )
k=1
n
( ) ( ) Õ p ( wk Ci )
Wiki check
Eliminate tokens notp Ci × in
p Ci W = existing
English Wikipedia as articles k=1
34. p (Ci w1, w2 ,… , wd ) = p ( w1, w2 ,… , wd Ci ) × p (Ci )
n
p (W Ci ) = Õ p ( wk Ci )
k=1
n
p (Ci W ) = p (Ci ) × Õ p ( wk Ci )
k=1
36. List Recommendation
Vintage Cars
Action movies Astronomy
Astronomy Space Exploration
Robotics
Physics
Classic Movies
Movies
Cars Space
Neuroscience
Astronomy
Space Exploration
Science Comedy Movies
37. Many other interesting problems…
• Dupe detection
• Anti-spam
• Biases, mood
• News
• Adult content
• Metrics
• Trending
• Many more…
Editor's Notes
I want to step back a bit and ask… what
I want to step back a bit and ask… what
List are a new reality and since the fast adoption by the users
Lists can group very distinct topics like in the case of “Save for later” and although 60% of the list are described by only 1 topics there are cases in which