Collaborative Filtering Algorithms :Getting StartedVivek A. Ganesanvivganes@gmail.comBig Data Gods Meetup, Santa Clara, CA May 13,2013
Before we startCopyright 2013, Vivek A. Ganesan, All rights reserved 1o A BIG thank you to our sponsors –Big Data Cloudo Meeting Spaceo Supporto Check out their big data training
IntroductionCopyright 2013, Vivek A. Ganesan, All rights reserved 2o Program Outlineo This is an opt-in program, it is FREE! (as in beer)o We do social coding (which means you share yourcode as open source, Apache v2 license)o Program duration = 1 month, weekly sprintso Weekly meetup (topical + social coding + Q/A)o A weekend hackathon (Sat. afternoon) alternateweeks (deep technical immersion)o Demo at the end of the program
AgendaCopyright 2013, Vivek A. Ganesan, All rights reserved 3o Introduction to CF Algorithmso When to use CF?o Metricso Exerciseo Questions?
Introduction to CF AlgorithmsCopyright 2013, Vivek A. Ganesan, All rights reserved 4o A family of algorithms used to predicto The preference of an user for an item, giveno a matrix of user preferences for items, whereo preferences must be expressed numerically (for e.g.user ratings of item on a 1 to 5 integer scale)o Collaborative because it only looks at userpreferences and does not take in to account user oritem attributeso Filtering, is math speak for selecting a subset
CF : Common sense versionCopyright 2013, Vivek A. Ganesan, All rights reserved 5o Out of a large group of users who have rateditems :o Pick a “small” subset of users who are “similar” toyouo Now, for an item that you have not yet rated but your“similar” users have rated :o Figure out an “average” rating for the item from your“similar” group of userso Weigh it with your rating history and predict a rating
CF : VisualCopyright 2013, Vivek A. Ganesan, All rights reserved 6User/Movie Sleepless in Seattle Titanic Terminator 2Alice 5 5 3Bob 1 3 5Chandra 3 5 4Dawood 2 3 5Eduardo (you oractive user)2 4?
A sample approachCopyright 2013, Vivek A. Ganesan, All rights reserved 7o Compute Eduardo’s “similarity” to all otheruserso Pick the three users “most similar” to Eduardoo Weigh their ratings for Terminator 2 by theirdegree of similarity to Eduardoo Make sure that the predicted rating is withinthe given scale (0 to 5)o … and predict Eduardo’s rating for Terminator 2
Step 1 : Measuring SimilarityCopyright 2013, Vivek A. Ganesan, All rights reserved 8o Start with a distance metrico There are several : let’s pick Euclidean for e.g.o For n space, square root of sum of squareddifferenceso Convert it to a similarity score (0 to 1)o 1/(1 + Euclidean Distance) (adding 1 to avoiddivision by zero)o 0 for no match, 1 for perfect match
CF : Distances & SimilaritiesCopyright 2013, Vivek A. Ganesan, All rights reserved 9Alice Bob Chandra Dawood3.16 & 0.24 1.414 & 0.414 1.414 & 0.414 1 & 0.5• Pick the top three users most similar to Eduardo :• Dawood, Bob and Chandra• Weigh their ratings for Terminator 2 by theirdegree of similarity to Eduardo :• (0.414 x 5) + (0.414 x 4) + (0.5 x 5) = 6.226• Ooops – too big a rating (0 to 5 scale)!• Divide by sum of similarities (0.414 + 0.414 + 0.5)• Answer : 6.226/1.328 = 4.688 (our prediction)
ImprovementsCopyright 2013, Vivek A. Ganesan, All rights reserved 10o Some users rate movies consistently higher andothers rate them consistently lowero Adjust for this by adding distance from meanand then finally adding mean of the activeusero Consult the Group Lens paper for detailso Use other measures that solves for “gradeinflation” e.g. Pearson’s
A recommendation engineCopyright 2013, Vivek A. Ganesan, All rights reserved 11o Imagine a much larger data set of users andmovie ratingso Do the same math for all users against all otheruserso Then predict ratings for those movies for whichusers have not yet ratedo For a given user, pick the top N predicted ratingmovies and recommend those
Questions? Comments?Thank You!E-mail: email@example.comTwitter : onevivekCopyright 2013, Vivek A. Ganesan, All rightsreserved12