Collaborative Filtering Algorithms :Common Problems and SolutionsVivek A. Ganesanvivganes@gmail.comBig Data Gods Meetup, S...
Before we startCopyright 2013, Vivek A. Ganesan, All rights reserved 1o A BIG thank you to our sponsors –Big Data Cloudo M...
IntroductionCopyright 2013, Vivek A. Ganesan, All rights reserved 2o Program Outlineo This is an opt-in program, it is FRE...
AgendaCopyright 2013, Vivek A. Ganesan, All rights reserved 3o CF Algorithms : recapo Problems with CF and solutionso Upda...
CF : Common sense versionCopyright 2013, Vivek A. Ganesan, All rights reserved 4o Out of a large group of users who have r...
CF : VisualCopyright 2013, Vivek A. Ganesan, All rights reserved 5User/Movie Sleepless in Seattle Titanic Terminator 2Alic...
A sample approachCopyright 2013, Vivek A. Ganesan, All rights reserved 6o Compute Eduardo’s “similarity” to all otherusers...
Step 1 : Measuring SimilarityCopyright 2013, Vivek A. Ganesan, All rights reserved 7o Start with a distance metrico There ...
CF : Distances & SimilaritiesCopyright 2013, Vivek A. Ganesan, All rights reserved 8Alice Bob Chandra Dawood3.16 & 0.24 1....
Pick the correct similarity metricCopyright 2013, Vivek A. Ganesan, All rights reserved 9o Pearson’s Correlation Co-effici...
Cold Start ProblemCopyright 2013, Vivek A. Ganesan, All rights reserved 10o First User problemo New user does not have any...
Sparse Ratings ProblemCopyright 2013, Vivek A. Ganesan, All rights reserved 11o Given large numbers of users and items,o M...
User Quirks ProblemsCopyright 2013, Vivek A. Ganesan, All rights reserved 12o Power/Super Userso Users who rate unusually ...
Some ConsiderationsCopyright 2013, Vivek A. Ganesan, All rights reserved 13o K.I.S.So If you don’t understand the approach...
For this scrumCopyright 2013, Vivek A. Ganesan, All rights reserved 14o Don’t worry too much about the problemso Goal is t...
Questions? Comments?Thank You!E-mail: vivganes@gmail.comTwitter : onevivekCopyright 2013, Vivek A. Ganesan, All rightsrese...
Upcoming SlideShare
Loading in …5
×

Collaborative filtering common_problems_and_solutions

846 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
846
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
33
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Collaborative filtering common_problems_and_solutions

  1. 1. Collaborative Filtering Algorithms :Common Problems and SolutionsVivek A. Ganesanvivganes@gmail.comBig Data Gods Meetup, Santa Clara, CA May 13,2013
  2. 2. Before we startCopyright 2013, Vivek A. Ganesan, All rights reserved 1o A BIG thank you to our sponsors –Big Data Cloudo Meeting Spaceo Supporto Check out their big data training
  3. 3. IntroductionCopyright 2013, Vivek A. Ganesan, All rights reserved 2o Program Outlineo This is an opt-in program, it is FREE! (as in beer)o We do social coding (which means you share yourcode as open source, Apache v2 license)o Program duration = 1 month, weekly sprintso Weekly meetup (topical + social coding + Q/A)o A weekend hackathon (Sat. afternoon) alternateweeks (deep technical immersion)o Demo at the end of the program
  4. 4. AgendaCopyright 2013, Vivek A. Ganesan, All rights reserved 3o CF Algorithms : recapo Problems with CF and solutionso Update on the Projecto Questions?o Discussion
  5. 5. CF : Common sense versionCopyright 2013, Vivek A. Ganesan, All rights reserved 4o Out of a large group of users who have rateditems :o Pick a “small” subset of users who are “similar” toyouo Now, for an item that you have not yet rated but your“similar” users have rated :o Figure out an “average” rating for the item from your“similar” group of userso Weigh it with your rating history and predict a rating
  6. 6. CF : VisualCopyright 2013, Vivek A. Ganesan, All rights reserved 5User/Movie Sleepless in Seattle Titanic Terminator 2Alice 5 5 3Bob 1 3 5Chandra 3 5 4Dawood 2 3 5Eduardo (you oractive user)2 4?
  7. 7. A sample approachCopyright 2013, Vivek A. Ganesan, All rights reserved 6o Compute Eduardo’s “similarity” to all otheruserso Pick the three users “most similar” to Eduardoo Weigh their ratings for Terminator 2 by theirdegree of similarity to Eduardoo Make sure that the predicted rating is withinthe given scale (0 to 5)o … and predict Eduardo’s rating for Terminator 2
  8. 8. Step 1 : Measuring SimilarityCopyright 2013, Vivek A. Ganesan, All rights reserved 7o Start with a distance metrico There are several : let’s pick Euclidean for e.g.o For n space, square root of sum of squareddifferenceso Convert it to a similarity score (0 to 1)o 1/(1 + Euclidean Distance) (adding 1 to avoiddivision by zero)o 0 for no match, 1 for perfect match
  9. 9. CF : Distances & SimilaritiesCopyright 2013, Vivek A. Ganesan, All rights reserved 8Alice Bob Chandra Dawood3.16 & 0.24 1.414 & 0.414 1.414 & 0.414 1 & 0.5• Pick the top three users most similar to Eduardo :• Dawood, Bob and Chandra• Weigh their ratings for Terminator 2 by theirdegree of similarity to Eduardo :• (0.414 x 5) + (0.414 x 4) + (0.5 x 5) = 6.226• Ooops – too big a rating (0 to 5 scale)!• Divide by sum of similarities (0.414 + 0.414 + 0.5)• Answer : 6.226/1.328 = 4.688 (our prediction)
  10. 10. Pick the correct similarity metricCopyright 2013, Vivek A. Ganesan, All rights reserved 9o Pearson’s Correlation Co-efficiento COV(x,y)/(SD(x)*SD(y))o Scale-invariant i.e. adjusts for rating biaso However, can give skewed results for smalldimensionso Solution : Use a smoothing functiono Other metrics : Cosine similarity, Jaccard,Tanimotoetc.
  11. 11. Cold Start ProblemCopyright 2013, Vivek A. Ganesan, All rights reserved 10o First User problemo New user does not have any ratingso No good way in CF to find similar users based onratings (no rating history for new user)o A Solution : Start with popular items or item-basedo First Item problemo Use Item attributes to recommend similar items
  12. 12. Sparse Ratings ProblemCopyright 2013, Vivek A. Ganesan, All rights reserved 11o Given large numbers of users and items,o Most users would only rate a handful of itemso So, the number of users who would have rated thesame set of items would be quite lesso Throws off the recommendations (small set ofusers to recommend from)o A Solution : Hybrid recommenders i.e. use bothcollaborative and content-based approacheso Also : Use model-based approaches
  13. 13. User Quirks ProblemsCopyright 2013, Vivek A. Ganesan, All rights reserved 12o Power/Super Userso Users who rate unusually high number of itemso Black Sheep Userso So idiosyncratic that recommendations breakdowno Skewed ratingso Usually done deliberately (for e.g. to boost one’srestaurant and/or disparage a competitor)
  14. 14. Some ConsiderationsCopyright 2013, Vivek A. Ganesan, All rights reserved 13o K.I.S.So If you don’t understand the approach, don’t use ito Test, test, testo Use RMSE to test on existing datao Do A/B testing on a live systemo Try hybrid approacheso Use combination of item-based and user-basedplus content attributes
  15. 15. For this scrumCopyright 2013, Vivek A. Ganesan, All rights reserved 14o Don’t worry too much about the problemso Goal is to learn Collaborative Filteringo However, …o Do implement testingo For instance, remove a few ratings from thedata set and see how close the system canpredict those (Use RMSE as a test metric)o A/B testing for live systems
  16. 16. Questions? Comments?Thank You!E-mail: vivganes@gmail.comTwitter : onevivekCopyright 2013, Vivek A. Ganesan, All rightsreserved15

×