Data Science plays an important part in improving the experiences of the users at Whisper, an anonymous social network. In this talk, we will first give an overview of the various problems the Data Science team tackles at Whisper. The focus will be on user understanding strategies that lend themselves to recommendation/personalization as well as identifying content with wide appeal. This talk will have greater depth and new content compared to our talk at the Data Science meetup in March.
Big Data Day LA 2015 - Data Science at Whisper - From content quality to personalization by Ulas Bardak of Whisper
1. Data Science at Whisper:
From Content Quality to
Personalization
ULAS BARDAK, MAARTEN BOSMA, MARK HSIAO
Whisper @ Big Data Day, LA - 2015/06/27
2. This Talk
Introduction to Whisper
Data Science problems overview
Examples
Personalization and Recommendations
Deep dive – Identifying Similar Users and its applications
Like-minded users for recommendations
Identifying content with broad appeal
Whisper @ Big Data Day, LA - 2015/06/27
3. The Rise of the Anonymous Apps
Google Trends – “Anonymous App”
Whisper @ Big Data Day, LA - 2015/06/27
4. A little background on Whisper
Anonymous Social Network
focused on mobile apps
Users come to share secrets,
make confessions, find others to
connect to
No need to create an account
Engagement through replies,
direct messages, “hearts”
Millions of users & hundreds of
millions of whispers
Whisper @ Big Data Day, LA - 2015/06/27
5. High Level Usage Patterns
App Launch
Recommended
Whispers
Recommendation
Engine
User + Content
Models
User Engagement
Whisper Create
Suggest Image
Creation Flow
Interaction Flow
Whisper @ Big Data Day, LA - 2015/06/27
6. Some Problems We Are Tackling
Content Understanding
• Spam detection
• Language detection
• Content quality
prediction
• Content classification
• Image Suggestion
User Understanding
• Spammer detection
• Personalization
• Similar user detection
• Churn prediction
Overall
• A/B testing
• Reporting
Whisper @ Big Data Day, LA - 2015/06/27
9. Recommendations
Problem:
Showing every user the exact same content is not
efficient. Engagement and interest depend on matching
users’ preferences to content, i.e. personalization.
Requirements:
Fast and able to work with little data
Whisper @ Big Data Day, LA - 2015/06/27
10. Recommendation Engine
High Personalization
• Like-minded users
• Collaborative
Filtering
• …
High Coverage
• Popular in location
• Recently popular
• Popular with new
users
• …
Combiner
• Merge results, deciding
on the right ordering
• If not enough results,
use fallback methods to
backfill.
Whisper @ Big Data Day, LA - 2015/06/27
11. High Personalization
• Identifying Like-minded Users for Recommendations (by Nick Stucky-Mack)
• Online learning to rank for Collaborative Filtering
Whisper @ Big Data Day, LA - 2015/06/27
Ko-Jen (Mark) Hsiao
12. How do we find likeminded users?
1. Agglomeration [Convert the user into a giant document]
2. Pre-processing [Lowercase, remove stopwords, etc..]
3. Vectorization [Bag of words into vectors]
4. Dimensionality reduction [Autoencoder maps 5K+ into ~100]
5. Similarity calculation [Top k users via cosine similarity]
6. Recommendation [Collect whispers from similar users]
7. Feedback [Regenerate model with new activity]
Whisper @ Big Data Day, LA - 2015/06/27
13. High Personalization
• Identifying Like-minded Users for Recommendations
• Online learning to rank for Collaborative Filtering
Whisper @ Big Data Day, LA - 2015/06/27
15. Collaborative Filtering
We want to learn a low dimensional embedding for each
user and each Whisper.
Instead of solving this problem by matrix factorization, we
view this as a ranking problem.
We only care about top recommended results, not accurate
score predictions for all whispers.
Whisper @ Big Data Day, LA - 2015/06/27
16. Learning to rank
Basic idea:
Every time there is an interaction between a user a and
whisper w update embeddings a and w such that
corresponding inner product has a higher value.
Whisper @ Big Data Day, LA - 2015/06/27
17. Collaborative Filtering
Learn a score function f(u,w) that gives scores for whispers
given a user. Ex:
Define a rank function that ranks all whispers for all users
f u,w( )=Uu
T
×Ww
rank u,w( )= I f u,k( )> f u,w( ){ }
kÎw,k¹w
å
Whisper @ Big Data Day, LA - 2015/06/27
18. Learning to rank
We can then define an error function
using the template:
where L is a non-decreasing loss
function and rank is the actual rank.
err f x( ), y( )= L rank x, y( )( )
Rank
Loss
Whisper @ Big Data Day, LA - 2015/06/27
19. Learning to rank
Problem: For large datasets like ours, it is computationally
expensive to obtain exact ranks of items.
Solution: Don’t use exact rank! Follow a sampling process to
approximate the rank:
where D is how many times of random samplings before we find the first violation.
Online learning to rank - utilize Weighted Approximate Rank
Pairwise Loss and optimize with stochastic gradient descent.
*Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings.
Machine learning, 81(1):21–35, 2010.
approx-rank = #TotalWhispers
D
Whisper @ Big Data Day, LA - 2015/06/27
20. Evaluation
Achieved recall@1%:
40% for our Hearts dataset.
20% for our Conversations dataset.
Used for personalized push notifications and feed
generation.
Whisper @ Big Data Day, LA - 2015/06/27
26. What is high quality?
Liked by a wide variety of people
Deep, Emotional
Text: Writing style, grammar & spelling
Image: Quality photo
Original
“High stakes”
Whisper @ Big Data Day, LA - 2015/06/27
27. Popular ≠ High quality
Great content might not get exposure
Selection bias
Low quality content might be engaging
Low quality content generates attention
Still useful to rank a set of preselected whispers
Whisper @ Big Data Day, LA - 2015/06/27
28. The problem with using only
recommendations
May be one-sided
Exploitation, no exploration
Algorithm makes mistakes
New content problem
Whisper @ Big Data Day, LA - 2015/06/27
29. The solution
Two potential uses:
Human Curation
Use quality score as a tool to find high quality whispers
Quality Score Filter for recommendations
Quality score for each piece of content
Whisper @ Big Data Day, LA - 2015/06/27
30. Basic Model
40k whispers promoted by curators
100k whispers from background collection
Model
Logistic Regression
χ2 feature selection
1 to 6 n-grams of characters
Whisper @ Big Data Day, LA - 2015/06/27
31. Additional Features
Length of text
Pos-Tags
Punctuation, Capitalization
Similarity with background corpus
Likelihood under language model
Out-of-vocabulary
Topic Models
KL Divergence
See Agichtein et. al., Finding High-Quality Content in Social Media, WSDM, 2008
Whisper @ Big Data Day, LA - 2015/06/27
32. TextShape
Opposite of stop word removal and stemming
Used as alternative model to find good whispers
Ex:
I danced with two people at my wedding. The one I married,
and the one man I loved.
I xed with two x at my xing. The one I xed, and the one x I
xed.
Whisper @ Big Data Day, LA - 2015/06/27
33. Thank You for Listening!
Questions?
For more info:
http://www.whisper.sh
Contact us at ulas@whisper.sh
Try out the app for yourself!
Whisper @ Big Data Day, LA - 2015/06/27