Definition ofRecommendation Engine"A recommendation system providesinformation or items that are likely to be ofinterest to a user, in an automated fashion”- Alpa Jain from Twitter"Serve the right item to users in anautomated fashion to optimize long-termbusiness objectives"- Deepak Agarwal from Yahoo
Examples• Related Product (Amazon)• Movie Recommendation (Netflix)• News Contents (Yahoo)• Online Dating (eHarmony)• Search Autocomplete (Google)• Connection Recommendation (LinkedIn)• Song Recommendation (Pandora)• Walmart – (Physical) Store Layout
Why Recommendation?• A way for users to find contents of interest (from large selections) with less efforts. o Natural way to personalization! o Serendipity factor• For companies, a good way to introduce new and unknown contents
Item basedrecommendation (1)1. Content-based Item Recommendation. o Using meta data from Item, compute similarity between items. i. Description, price, category and so on ii. Normalize these into a feature vector (numeric values) i. You can think of it as a point in N-dimension. iii. Compute the distances between vectors. i. Euclidean Distance Score ii. Cosine Similarity Score iii. Pearson Correlation Score
Item basedrecommendation (2)2. Collaborative Filtering. o Leverage users’ collective intelligence Similar users tend to like similar items Amazon’s product recommendation is a very good and famous example o Will look at this in more detail
User basedrecommendation• First group users into different clusters o Represent users as feature vectors Information about users: • geo-location, gender, age, … Items users liked or rated o K-nearest neighbors (KNN) is used a lot• From each cluster, find representative items o Some kind of graph traversal o Highest rated items o Most liked items
Challenges ofRecommendation Engine• Cold Starter o For new users and/or items, no information to leverage.• Sparse Data o Item reviews or purchases are not very common.• Scalability Issue o The bigger the data gets, the more computation is needed.
What is Hadoop?• An open source distributed computation and storage platform after Google File System and MapReduce framework• Perfect fit for large scale batch offline processing but not for realtime processing• Widely used in many companies
What is Mahout?• An open source machine learning library written in Java. 1. Standalone 2. MapReduce. o Supports large scale batch offline processing.• Covers the followings o Recommendation/Collaborative Filtering. o Classification: Supervised Learning. o Clustering: Unsupervised Learning.
Typical Architecture Data Collection Web server logs, MySQL tables, ... (explicit Input Data Pre-processing (ETL, Filtering, …) feedback and implicit feedback) Recommendation Data Building (Mahout) Output Data Post-processing (Re-ordering) Hadoop Load Final Data To Serving Layer MySQL, NoSQL, Recommendation Serving Layer Solr/ElasticSearch, ...
Use Case:Polyvore – Item Page Item in question Content Based Recommendation Collaborative Filtering
Use Case:Polyvore – Home Page Personalized Recommendation
People who liked thisalso like ...• This is based on "Collaborative Filtering”• Construct co-occurrence matrix or Item similarity matrix – S[NxN] o Increment S[i,j] and S[j,i] if item i and item j are liked by the same user o Repeat this for all users for their liked items• For item k, find the most co-occurred items (from column k or row k) as recommendations.
PersonalizedRecommendation• This is based on "Collaborative Filtering”• Extension of previous topic• Computation-wise, matrix multiplication a. First, build a similar matrix (S) for items b. Next, build a preference vector (P) for user c. Next, multiply two matrices from a and b R=SxP a. Lastly, sort the final vector elements of R
Polyvore Example• Assumption: o N items and M users. Users can only like (no rating)• Create item similarity matrix of S (NxN) o This will be used as recommendations in Item page• Create user preference vector of P(1xN) o Set all P(i) which are liked by the user in question• Multiply S by P o Sort result elements by the score o This will be personalized item recommendation
How to use Mahout?• ItemSimilarityJob class • Main class to compute co-occurrence matrix.• RecommenderJob class • Main class to generate personalized recommendations.hadoop jar mahout-core-0.8-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input/user-item-rating.txt -Dmapred.output.dir=output --usersFile input/users.txt --booleanData --similarityClassname SIMILARITY_COOCCURRENCE --minPrefsPerUser 2 --maxPrefsPerUser 50000This will run total 10 mapreduce jobs to generate final recommendations forusers
How to use Mahout?(Contd)• Input File: user-item-rating.txt o userID,itemID[,rating] per line.• How to compute similarity between Items o --similarityClassname parameter determines CooccurrenceCountSimilarity LogLikelihoodSimilarity TanimotoCoefficientSimilarity CityBlockSimilarity CosineSimilarity PearsonCorrelationSimilarity EuclideanDistanceSimilarity
How to use Mahout?(Contd)• Final Output o UserID [(ItemID,Score),(ItemID,Score),...... o ...• Load this from HDFS to a serving layer o Relational Database o Search Engine o NoSQL
Lessons• Need to understand business domain o This takes time and efforts• Garbage In Garbage Out o Filtering is very important• Start with simple approach o And then improve gradually• Having automated pipeline is very important o More experiments with less efforts is doable o Remember you will have to do lots of experiments o But it is hard and takes time to build
Next stage ofrecommendation?• Need realtime & scalable recommendation technology.• Recommendation As A Service. • www.myrrix.com