2. User Recommendation Problem
●
●
●
First step: Candidate set generation
Second step: Rank candidates using a supervised ML model
Problem?
●
●
●
●
●
Need to generate training data for the ML model
Generate candidates (2 hop) for users in an old social graph, say 1 month
before
Look at current social graph, if a link was established between user, candidate
in the current graph, treat the edge as a positive class.
If a link was not established, treat the edge as a negative class.
Not the best way to get Training Data as edges actually formed depend on
the previous recommendation algorithm, but a good start.
3. Candidate Set Generation
Which Users Do you want to consider for WTF recs
● Simple Approach: All Users at 2 hops are candidates (ranked by the
total number of hops, just take the top 200)
● Complex Approaches
●
●
Use personalized PageRank, SALSA to find candidates for each user.
Use user interaction to get weighted social graph, then perform above
techniques.
Many users (around 50% users do not have 2 hop neighborhood)
● Use facebook friends as candidates (only 16% users don’t have fb
candidates, and 5 % of users don’t have fb candidates or 2 hop
neighbors)
● Use Approximate Nearest Neighbors
4. Extracting Features
●
●
●
●
●
●
●
●
●
●
●
●
hops: number of paths of length 2 between user1 and user2
hopslog: hops/log(# of subscribers user2 has)
common: no. of common neighbors shared by user1 and user2
jaccard: common/(union of neighbors of user1 and user2)
cosine: cosine similarity of user vectors of user1 and user2
adamic: summation over neighbors of user1 [1/log(# of subscribers of
the neighbor)]
indegree: in degree of user2
fraction_n2: for 2 users i and j, fraction of subscriptions of i that are
following j
fraction_n1: for 2 users i and j, fraction of subscriptions of j that have i
follows
pref_attachment: number of subscriptions of i * num of followers of j
reverse_edge: of i,j = 1 if j follows i
Label: positive or negative class, as described in slide 2.
5. Ranking Features by Importance
●
●
●
●
●
●
●
●
●
●
●
0.185521009562 hops
0.151976624315 fraction_n2
0.126571252655 fraction_n1
0.126321244854 cosine
0.0828860325682 pref_attachment
0.0709010797719 indegree_j
0.0660478462424 hopslog
0.0649419577136 adamic
0.0531705297389 common
0.0372079185808 jaccard
0.0344545039974 reverse_edge
As given by Gradient Boosted Regression Trees. This ranking should be
looked at just as an indication because many features like fraction_n2,
fraction_n1, jaccard are dependent on each other, and features like
cosine similarity don’t depend on other features.
6. Extracting Features
●
More Features that can be considered in the future:
●
Facebook friend Boolean, PageRank score, Geographic Distance, Age
Difference, …
7. Machine Learning Models
● Tried Logistic Regression, SVM, Random Forests, in the end Gradient
Boosted Decision Trees give the best performance. (68 - 69%)
● Though the model they’ve learnt depends on the current module which
is serving WTF recs.
● When pushed to production, model can learn from a better training set.
8. Results from testing with Spotify Employees
● Total Records: 1251
● Yes / Total = 22.14%
● Yes and I know the recommendation / Total responses where users
knew their recommendation = 61.11%
● Yes and I like the persons musical taste / Total responses where users
liked their recommendations taste = 61.36%
● Yes, I like and Know the recommended user / Total people who liked
and knew their recommendations = 78.57%
● Yes, I like users taste but I don’t know user / Total people who like taste
and didn’t know their recommendations= 35.7%
● Yes, I know the user but dislike users taste / Total people who disliked
taste and knew their recommendations= 17.8%
9. Optimizations:
● First I had converted each userID into an integer, loaded the entire
dataset into memory, and then done the computation.
● This was very difficult to convert to Multiprocessing Code. (Each
process tried to make a copy of the graph, which was not possible,
creating a shared object was very slow)
● Best option was to use a DataBase, because only retrieval was needed
to be done.
● Sparkey preferred to Tokyo Cabinet, because time to construct index
was much lower.
● 1 Process: Very Very Slow, 10 users per second
●
●
●
bound by call to OpenGraph API for spotify users’ FB friends
100 Processes: 92.6 users per second, 1 Million Users in 180 minutes
150 Processes: 116.7 users per second, 1.8 Million Users in 257 minutes
10. Resources
● Seminal paper by Kleinberg http://www.cs.cornell.
edu/home/kleinber/link-pred.pdf
● Supervised Learning http://www3.nd.edu/~dial/papers/KDD10.pdf
● Twitter http://www.stanford.edu/~rezab/papers/wtf_overview.pdf
●
●
Twitter’s WTF problem is pretty similar to ours, asymmetric follows
Future:
●
●
●
Supervised Random Walks http://cs.stanford.edu/people/jure/pubs/linkpredwsdm11.pdf
Large Scale Twitter http://www.umiacs.umd.
edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
Fast Page Rank http://arxiv.org/abs/1006.2880