Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

2,035 views

Published on

Presentation on scalable collaborative filtering algorithms on Apache Spark given at the the Tapad Taptalk on 6/6/2014

Published in: Technology, Education
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,035
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
58
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

  1. 1. Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark Evan Casey Taptech - 6/6/2014
  2. 2. Overview ● Apache Spark ○ Dataflow model ○ Spark vs Hadoop MapReduce ● Recommender Systems ○ Similarity-based collaborative filtering ○ Distributed implementation on Apache Spark ○ Lessons learned
  3. 3. Apache Spark ● Distributed data-processing framework built on top of HDFS ● Use cases: ○ Interactive analytics ○ Graph algorithms ○ Stream processing ○ Scalable ML ○ Recommendation engines!
  4. 4. Spark vs Hadoop MapReduce ● In-memory data flow model optimized for multi-stage jobs ● Novel approach to fault tolerance ● Similar programming style to Scalding/Cascading
  5. 5. Programming Model ● Resilient Distributed Dataset (RDD) ○ Textfile, parallelize ● Parallel Operations ○ Map, GroupBy, Filter, Join, etc ● Optimizations ○ Caching, shared variables ● Demo
  6. 6. What are recommendation algorithms? ● Problem: ○ “Information overload” ○ Diverse user interests ● User-Item Recommendation ○ Recommend content for each user based on a larger training set of user interaction histories
  7. 7. Motivation ● Large-scale recommender systems ○ Millions of users and items (100m+ ratings) ● Problems: ○ Memory-based approach ○ Scalability/Efficiency ○ User interaction sparsity
  8. 8. Collaborative Filtering Shawn Billy Mary 4 3 8 9 2 4 3 4 1 2 8 8 4 ● Similarity based approach ● Two main variants: ○ User-based ○ Item-based ? ? ? ? ?
  9. 9. User-based Collaborative Filtering ● Step 1: Obtain user-item matrix denoted Mi,j
  10. 10. User-based Collaborative Filtering ● Step 2: Calculate similarity between pairwise users and compute top-n nearest neighbors pairwise users rating vectors
  11. 11. User-based Collaborative Filtering ● Step 3: Compute weighted average of the ratings by the neighbors and find the top-n items with the score recommendation score of item pairwise user similarities mean rating co-rated user rating
  12. 12. Results Standalone Cluster: Amazon EC2 Cluster:
  13. 13. Evaluation
  14. 14. Lessons Learned ● Must manually specify number of tasks ○ Want 2-4 slices for each CPU in your cluster ● Use broadcast variables for shared data and cache for data that will be reused ● Must account for the “power users” ○ Sampling heavy tailed user-interaction histories ● Need to account for the rating scale of each user! ○ Adjusted cosine similarity and pearson correlation outperform normal cosine similarity

×