Buidling large scale recommendation engine


Published on

Quick survey of recommendation strategies and introduction of mahout

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Netflix: 7 days to 1 day. 30M watches per day.
  • 똣하지 않는 발견 !
  • This is effective when you have a lot more users than items.
  • 2% of users provide feedbacks
  • Make captions more visible and also Likes button on the far left.
  • Make captions more visible and also Likes button on the far left.
  • Access log case: lots of robots access. What would be business case for Polyvore. Where is your traffic coming from? What are user’s intetion? Sizes of users and items. Seasonality
  • Buidling large scale recommendation engine

    1. 1. BuildingRecommendationEngine Keeyong Han, Jan 2013
    2. 2. Table of Contents1. What is Recommendation?2. Different Recommendation Strategies3. Introduction of Hadoop/Mahout4. Building Recommendation Engine with Hadoop/Mahout5. How to use Mahout6. Q&A
    3. 3. What isRecommendation?
    4. 4. Definition ofRecommendation Engine"A recommendation system providesinformation or items that are likely to be ofinterest to a user, in an automated fashion”- Alpa Jain from Twitter"Serve the right item to users in anautomated fashion to optimize long-termbusiness objectives"- Deepak Agarwal from Yahoo
    5. 5. Examples• Related Product (Amazon)• Movie Recommendation (Netflix)• News Contents (Yahoo)• Online Dating (eHarmony)• Search Autocomplete (Google)• Connection Recommendation (LinkedIn)• Song Recommendation (Pandora)• Walmart – (Physical) Store Layout
    6. 6. Why Recommendation?• A way for users to find contents of interest (from large selections) with less efforts. o Natural way to personalization! o Serendipity factor• For companies, a good way to introduce new and unknown contents
    7. 7. DifferentRecommendationStrategiesItem vs. User
    8. 8. Item basedrecommendation (1)1. Content-based Item Recommendation. o Using meta data from Item, compute similarity between items. i. Description, price, category and so on ii. Normalize these into a feature vector (numeric values) i. You can think of it as a point in N-dimension. iii. Compute the distances between vectors. i. Euclidean Distance Score ii. Cosine Similarity Score iii. Pearson Correlation Score
    9. 9. Item basedrecommendation (2)2. Collaborative Filtering. o Leverage users’ collective intelligence  Similar users tend to like similar items  Amazon’s product recommendation is a very good and famous example o Will look at this in more detail
    10. 10. User basedrecommendation• First group users into different clusters o Represent users as feature vectors  Information about users: • geo-location, gender, age, …  Items users liked or rated o K-nearest neighbors (KNN) is used a lot• From each cluster, find representative items o Some kind of graph traversal o Highest rated items o Most liked items
    11. 11. Challenges ofRecommendation Engine• Cold Starter o For new users and/or items, no information to leverage.• Sparse Data o Item reviews or purchases are not very common.• Scalability Issue o The bigger the data gets, the more computation is needed.
    12. 12. Introduction ofHadoop/Mahout
    13. 13. What is Hadoop?• An open source distributed computation and storage platform after Google File System and MapReduce framework• Perfect fit for large scale batch offline processing but not for realtime processing• Widely used in many companies
    14. 14. What is Mahout?• An open source machine learning library written in Java. 1. Standalone 2. MapReduce. o Supports large scale batch offline processing.• Covers the followings o Recommendation/Collaborative Filtering. o Classification: Supervised Learning. o Clustering: Unsupervised Learning.
    15. 15. BuildingRecommendation Enginewith Hadoop/Mahout
    16. 16. Typical Architecture Data Collection Web server logs, MySQL tables, ... (explicit Input Data Pre-processing (ETL, Filtering, …) feedback and implicit feedback) Recommendation Data Building (Mahout) Output Data Post-processing (Re-ordering) Hadoop Load Final Data To Serving Layer MySQL, NoSQL, Recommendation Serving Layer Solr/ElasticSearch, ...
    17. 17. Use Case:Polyvore – Item Page Item in question Content Based Recommendation Collaborative Filtering
    18. 18. Use Case:Polyvore – Home Page Personalized Recommendation
    19. 19. People who liked thisalso like ...• This is based on "Collaborative Filtering”• Construct co-occurrence matrix or Item similarity matrix – S[NxN] o Increment S[i,j] and S[j,i] if item i and item j are liked by the same user o Repeat this for all users for their liked items• For item k, find the most co-occurred items (from column k or row k) as recommendations.
    20. 20. PersonalizedRecommendation• This is based on "Collaborative Filtering”• Extension of previous topic• Computation-wise, matrix multiplication a. First, build a similar matrix (S) for items b. Next, build a preference vector (P) for user c. Next, multiply two matrices from a and b  R=SxP a. Lastly, sort the final vector elements of R
    21. 21. Polyvore Example• Assumption: o N items and M users. Users can only like (no rating)• Create item similarity matrix of S (NxN) o This will be used as recommendations in Item page• Create user preference vector of P(1xN) o Set all P(i) which are liked by the user in question• Multiply S by P o Sort result elements by the score o This will be personalized item recommendation
    22. 22. How to use Mahout?• ItemSimilarityJob class • Main class to compute co-occurrence matrix.• RecommenderJob class • Main class to generate personalized recommendations.hadoop jar mahout-core-0.8-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input/user-item-rating.txt -Dmapred.output.dir=output --usersFile input/users.txt --booleanData --similarityClassname SIMILARITY_COOCCURRENCE --minPrefsPerUser 2 --maxPrefsPerUser 50000This will run total 10 mapreduce jobs to generate final recommendations forusers
    23. 23. How to use Mahout?(Contd)• Input File: user-item-rating.txt o userID,itemID[,rating] per line.• How to compute similarity between Items o --similarityClassname parameter determines  CooccurrenceCountSimilarity  LogLikelihoodSimilarity  TanimotoCoefficientSimilarity  CityBlockSimilarity  CosineSimilarity  PearsonCorrelationSimilarity  EuclideanDistanceSimilarity
    24. 24. How to use Mahout?(Contd)• Final Output o UserID [(ItemID,Score),(ItemID,Score),...... o ...• Load this from HDFS to a serving layer o Relational Database o Search Engine o NoSQL
    25. 25. Lessons• Need to understand business domain o This takes time and efforts• Garbage In Garbage Out o Filtering is very important• Start with simple approach o And then improve gradually• Having automated pipeline is very important o More experiments with less efforts is doable o Remember you will have to do lots of experiments o But it is hard and takes time to build
    26. 26. Next stage ofrecommendation?• Need realtime & scalable recommendation technology.• Recommendation As A Service. • www.myrrix.com
    27. 27. Q&Akeeyonghan@hotmail.com