SlideShare a Scribd company logo
1 of 11
Download to read offline
WhoToFollow
Recommendations
Rohan Agrawal
Fall 2013 Internship @ Spotify
User Recommendation Problem
●
●
●

First step: Candidate set generation
Second step: Rank candidates using a supervised ML model
Problem?
●
●

●

●
●

Need to generate training data for the ML model
Generate candidates (2 hop) for users in an old social graph, say 1 month
before
Look at current social graph, if a link was established between user, candidate
in the current graph, treat the edge as a positive class.
If a link was not established, treat the edge as a negative class.
Not the best way to get Training Data as edges actually formed depend on
the previous recommendation algorithm, but a good start.
Candidate Set Generation
Which Users Do you want to consider for WTF recs
● Simple Approach: All Users at 2 hops are candidates (ranked by the
total number of hops, just take the top 200)
● Complex Approaches
●
●

Use personalized PageRank, SALSA to find candidates for each user.
Use user interaction to get weighted social graph, then perform above
techniques.

Many users (around 50% users do not have 2 hop neighborhood)
● Use facebook friends as candidates (only 16% users don’t have fb
candidates, and 5 % of users don’t have fb candidates or 2 hop
neighbors)
● Use Approximate Nearest Neighbors
Extracting Features
●
●
●
●
●
●

●
●

●

●
●
●

hops: number of paths of length 2 between user1 and user2
hopslog: hops/log(# of subscribers user2 has)
common: no. of common neighbors shared by user1 and user2
jaccard: common/(union of neighbors of user1 and user2)
cosine: cosine similarity of user vectors of user1 and user2
adamic: summation over neighbors of user1 [1/log(# of subscribers of
the neighbor)]
indegree: in degree of user2
fraction_n2: for 2 users i and j, fraction of subscriptions of i that are
following j
fraction_n1: for 2 users i and j, fraction of subscriptions of j that have i
follows
pref_attachment: number of subscriptions of i * num of followers of j
reverse_edge: of i,j = 1 if j follows i
Label: positive or negative class, as described in slide 2.
Ranking Features by Importance
●
●
●
●
●
●
●
●
●
●
●

0.185521009562 hops
0.151976624315 fraction_n2
0.126571252655 fraction_n1
0.126321244854 cosine
0.0828860325682 pref_attachment
0.0709010797719 indegree_j
0.0660478462424 hopslog
0.0649419577136 adamic
0.0531705297389 common
0.0372079185808 jaccard
0.0344545039974 reverse_edge

As given by Gradient Boosted Regression Trees. This ranking should be
looked at just as an indication because many features like fraction_n2,
fraction_n1, jaccard are dependent on each other, and features like
cosine similarity don’t depend on other features.
Extracting Features
●

More Features that can be considered in the future:
●

Facebook friend Boolean, PageRank score, Geographic Distance, Age
Difference, …
Machine Learning Models
● Tried Logistic Regression, SVM, Random Forests, in the end Gradient
Boosted Decision Trees give the best performance. (68 - 69%)
● Though the model they’ve learnt depends on the current module which
is serving WTF recs.
● When pushed to production, model can learn from a better training set.
Results from testing with Spotify Employees
● Total Records: 1251
● Yes / Total = 22.14%
● Yes and I know the recommendation / Total responses where users
knew their recommendation = 61.11%
● Yes and I like the persons musical taste / Total responses where users
liked their recommendations taste = 61.36%
● Yes, I like and Know the recommended user / Total people who liked
and knew their recommendations = 78.57%
● Yes, I like users taste but I don’t know user / Total people who like taste
and didn’t know their recommendations= 35.7%
● Yes, I know the user but dislike users taste / Total people who disliked
taste and knew their recommendations= 17.8%
Optimizations:
● First I had converted each userID into an integer, loaded the entire
dataset into memory, and then done the computation.
● This was very difficult to convert to Multiprocessing Code. (Each
process tried to make a copy of the graph, which was not possible,
creating a shared object was very slow)
● Best option was to use a DataBase, because only retrieval was needed
to be done.
● Sparkey preferred to Tokyo Cabinet, because time to construct index
was much lower.
● 1 Process: Very Very Slow, 10 users per second
●

●
●

bound by call to OpenGraph API for spotify users’ FB friends

100 Processes: 92.6 users per second, 1 Million Users in 180 minutes
150 Processes: 116.7 users per second, 1.8 Million Users in 257 minutes
Resources
● Seminal paper by Kleinberg http://www.cs.cornell.
edu/home/kleinber/link-pred.pdf
● Supervised Learning http://www3.nd.edu/~dial/papers/KDD10.pdf
● Twitter http://www.stanford.edu/~rezab/papers/wtf_overview.pdf
●

●

Twitter’s WTF problem is pretty similar to ours, asymmetric follows

Future:
●

●

●

Supervised Random Walks http://cs.stanford.edu/people/jure/pubs/linkpredwsdm11.pdf
Large Scale Twitter http://www.umiacs.umd.
edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
Fast Page Rank http://arxiv.org/abs/1006.2880
Thank YOU!
Questions?

More Related Content

Similar to WhoToFollow @Spotify

Similar to WhoToFollow @Spotify (20)

SOLID refactoring - racing car katas
SOLID refactoring - racing car katasSOLID refactoring - racing car katas
SOLID refactoring - racing car katas
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
Medication Management2.pptx
Medication Management2.pptxMedication Management2.pptx
Medication Management2.pptx
 
Newly released app: tap-tap-tap or crap?
Newly released app: tap-tap-tap or crap?Newly released app: tap-tap-tap or crap?
Newly released app: tap-tap-tap or crap?
 
Hang preso 4
Hang preso 4Hang preso 4
Hang preso 4
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
3. Product Development
3. Product Development3. Product Development
3. Product Development
 
Prez3
Prez3Prez3
Prez3
 
Scrum à la Pablo (English)
Scrum à la Pablo (English)Scrum à la Pablo (English)
Scrum à la Pablo (English)
 
Track N Go!
Track N Go!Track N Go!
Track N Go!
 
Look Based Media Player
Look Based Media PlayerLook Based Media Player
Look Based Media Player
 
Building ZingMe News Feed System
Building ZingMe News Feed SystemBuilding ZingMe News Feed System
Building ZingMe News Feed System
 
Puff
PuffPuff
Puff
 
Recommender.system.presentation.pjug.01.21.2014
Recommender.system.presentation.pjug.01.21.2014Recommender.system.presentation.pjug.01.21.2014
Recommender.system.presentation.pjug.01.21.2014
 
The UX Analyst
The UX AnalystThe UX Analyst
The UX Analyst
 
Evaluation of end user services offered through the web
Evaluation of end user services offered through the webEvaluation of end user services offered through the web
Evaluation of end user services offered through the web
 
HCI Group Project Presentation
HCI Group Project PresentationHCI Group Project Presentation
HCI Group Project Presentation
 
Building zing me news feed system
Building zing me news feed systemBuilding zing me news feed system
Building zing me news feed system
 
Revamping FYP using Agile Methodology.pptx
Revamping FYP using Agile Methodology.pptxRevamping FYP using Agile Methodology.pptx
Revamping FYP using Agile Methodology.pptx
 

Recently uploaded

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 

Recently uploaded (20)

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 

WhoToFollow @Spotify

  • 2. User Recommendation Problem ● ● ● First step: Candidate set generation Second step: Rank candidates using a supervised ML model Problem? ● ● ● ● ● Need to generate training data for the ML model Generate candidates (2 hop) for users in an old social graph, say 1 month before Look at current social graph, if a link was established between user, candidate in the current graph, treat the edge as a positive class. If a link was not established, treat the edge as a negative class. Not the best way to get Training Data as edges actually formed depend on the previous recommendation algorithm, but a good start.
  • 3. Candidate Set Generation Which Users Do you want to consider for WTF recs ● Simple Approach: All Users at 2 hops are candidates (ranked by the total number of hops, just take the top 200) ● Complex Approaches ● ● Use personalized PageRank, SALSA to find candidates for each user. Use user interaction to get weighted social graph, then perform above techniques. Many users (around 50% users do not have 2 hop neighborhood) ● Use facebook friends as candidates (only 16% users don’t have fb candidates, and 5 % of users don’t have fb candidates or 2 hop neighbors) ● Use Approximate Nearest Neighbors
  • 4. Extracting Features ● ● ● ● ● ● ● ● ● ● ● ● hops: number of paths of length 2 between user1 and user2 hopslog: hops/log(# of subscribers user2 has) common: no. of common neighbors shared by user1 and user2 jaccard: common/(union of neighbors of user1 and user2) cosine: cosine similarity of user vectors of user1 and user2 adamic: summation over neighbors of user1 [1/log(# of subscribers of the neighbor)] indegree: in degree of user2 fraction_n2: for 2 users i and j, fraction of subscriptions of i that are following j fraction_n1: for 2 users i and j, fraction of subscriptions of j that have i follows pref_attachment: number of subscriptions of i * num of followers of j reverse_edge: of i,j = 1 if j follows i Label: positive or negative class, as described in slide 2.
  • 5. Ranking Features by Importance ● ● ● ● ● ● ● ● ● ● ● 0.185521009562 hops 0.151976624315 fraction_n2 0.126571252655 fraction_n1 0.126321244854 cosine 0.0828860325682 pref_attachment 0.0709010797719 indegree_j 0.0660478462424 hopslog 0.0649419577136 adamic 0.0531705297389 common 0.0372079185808 jaccard 0.0344545039974 reverse_edge As given by Gradient Boosted Regression Trees. This ranking should be looked at just as an indication because many features like fraction_n2, fraction_n1, jaccard are dependent on each other, and features like cosine similarity don’t depend on other features.
  • 6. Extracting Features ● More Features that can be considered in the future: ● Facebook friend Boolean, PageRank score, Geographic Distance, Age Difference, …
  • 7. Machine Learning Models ● Tried Logistic Regression, SVM, Random Forests, in the end Gradient Boosted Decision Trees give the best performance. (68 - 69%) ● Though the model they’ve learnt depends on the current module which is serving WTF recs. ● When pushed to production, model can learn from a better training set.
  • 8. Results from testing with Spotify Employees ● Total Records: 1251 ● Yes / Total = 22.14% ● Yes and I know the recommendation / Total responses where users knew their recommendation = 61.11% ● Yes and I like the persons musical taste / Total responses where users liked their recommendations taste = 61.36% ● Yes, I like and Know the recommended user / Total people who liked and knew their recommendations = 78.57% ● Yes, I like users taste but I don’t know user / Total people who like taste and didn’t know their recommendations= 35.7% ● Yes, I know the user but dislike users taste / Total people who disliked taste and knew their recommendations= 17.8%
  • 9. Optimizations: ● First I had converted each userID into an integer, loaded the entire dataset into memory, and then done the computation. ● This was very difficult to convert to Multiprocessing Code. (Each process tried to make a copy of the graph, which was not possible, creating a shared object was very slow) ● Best option was to use a DataBase, because only retrieval was needed to be done. ● Sparkey preferred to Tokyo Cabinet, because time to construct index was much lower. ● 1 Process: Very Very Slow, 10 users per second ● ● ● bound by call to OpenGraph API for spotify users’ FB friends 100 Processes: 92.6 users per second, 1 Million Users in 180 minutes 150 Processes: 116.7 users per second, 1.8 Million Users in 257 minutes
  • 10. Resources ● Seminal paper by Kleinberg http://www.cs.cornell. edu/home/kleinber/link-pred.pdf ● Supervised Learning http://www3.nd.edu/~dial/papers/KDD10.pdf ● Twitter http://www.stanford.edu/~rezab/papers/wtf_overview.pdf ● ● Twitter’s WTF problem is pretty similar to ours, asymmetric follows Future: ● ● ● Supervised Random Walks http://cs.stanford.edu/people/jure/pubs/linkpredwsdm11.pdf Large Scale Twitter http://www.umiacs.umd. edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf Fast Page Rank http://arxiv.org/abs/1006.2880