Scalable Collaborative Filtering
Recommendation Algorithms on
Apache Spark
Evan Casey
Taptech - 6/6/2014
Overview
● Apache Spark
○ Dataflow model
○ Spark vs Hadoop MapReduce
● Recommender Systems
○ Similarity-based collaborative filtering
○ Distributed implementation on Apache Spark
○ Lessons learned
Apache Spark
● Distributed data-processing
framework built on top of HDFS
● Use cases:
○ Interactive analytics
○ Graph algorithms
○ Stream processing
○ Scalable ML
○ Recommendation engines!
Spark vs Hadoop MapReduce
● In-memory data flow model
optimized for multi-stage
jobs
● Novel approach to fault
tolerance
● Similar programming style
to Scalding/Cascading
Programming Model
● Resilient Distributed Dataset (RDD)
○ Textfile, parallelize
● Parallel Operations
○ Map, GroupBy, Filter, Join, etc
● Optimizations
○ Caching, shared variables
● Demo
What are recommendation
algorithms?
● Problem:
○ “Information overload”
○ Diverse user interests
● User-Item Recommendation
○ Recommend content for each user
based on a larger training set of
user interaction histories
Motivation
● Large-scale recommender systems
○ Millions of users and items (100m+ ratings)
● Problems:
○ Memory-based approach
○ Scalability/Efficiency
○ User interaction sparsity
Collaborative Filtering
Shawn
Billy
Mary
4 3 8 9
2
4
3 4
1
2
8 8
4
● Similarity based
approach
● Two main variants:
○ User-based
○ Item-based
?
? ?
?
?
User-based Collaborative Filtering
● Step 1:
Obtain user-item
matrix denoted Mi,j
User-based Collaborative Filtering
● Step 2:
Calculate similarity between
pairwise users and compute
top-n nearest neighbors
pairwise
users
rating
vectors
User-based Collaborative Filtering
● Step 3:
Compute weighted average of
the ratings by the neighbors
and find the top-n items with
the score
recommendation
score of item
pairwise user
similarities
mean rating
co-rated user
rating
Results
Standalone Cluster: Amazon EC2 Cluster:
Evaluation
Lessons Learned
● Must manually specify number of tasks
○ Want 2-4 slices for each CPU in your cluster
● Use broadcast variables for shared data and cache for
data that will be reused
● Must account for the “power users”
○ Sampling heavy tailed user-interaction histories
● Need to account for the rating scale of each user!
○ Adjusted cosine similarity and pearson correlation outperform
normal cosine similarity

Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

  • 1.
    Scalable Collaborative Filtering RecommendationAlgorithms on Apache Spark Evan Casey Taptech - 6/6/2014
  • 2.
    Overview ● Apache Spark ○Dataflow model ○ Spark vs Hadoop MapReduce ● Recommender Systems ○ Similarity-based collaborative filtering ○ Distributed implementation on Apache Spark ○ Lessons learned
  • 3.
    Apache Spark ● Distributeddata-processing framework built on top of HDFS ● Use cases: ○ Interactive analytics ○ Graph algorithms ○ Stream processing ○ Scalable ML ○ Recommendation engines!
  • 4.
    Spark vs HadoopMapReduce ● In-memory data flow model optimized for multi-stage jobs ● Novel approach to fault tolerance ● Similar programming style to Scalding/Cascading
  • 5.
    Programming Model ● ResilientDistributed Dataset (RDD) ○ Textfile, parallelize ● Parallel Operations ○ Map, GroupBy, Filter, Join, etc ● Optimizations ○ Caching, shared variables ● Demo
  • 6.
    What are recommendation algorithms? ●Problem: ○ “Information overload” ○ Diverse user interests ● User-Item Recommendation ○ Recommend content for each user based on a larger training set of user interaction histories
  • 7.
    Motivation ● Large-scale recommendersystems ○ Millions of users and items (100m+ ratings) ● Problems: ○ Memory-based approach ○ Scalability/Efficiency ○ User interaction sparsity
  • 8.
    Collaborative Filtering Shawn Billy Mary 4 38 9 2 4 3 4 1 2 8 8 4 ● Similarity based approach ● Two main variants: ○ User-based ○ Item-based ? ? ? ? ?
  • 9.
    User-based Collaborative Filtering ●Step 1: Obtain user-item matrix denoted Mi,j
  • 10.
    User-based Collaborative Filtering ●Step 2: Calculate similarity between pairwise users and compute top-n nearest neighbors pairwise users rating vectors
  • 11.
    User-based Collaborative Filtering ●Step 3: Compute weighted average of the ratings by the neighbors and find the top-n items with the score recommendation score of item pairwise user similarities mean rating co-rated user rating
  • 12.
  • 13.
  • 14.
    Lessons Learned ● Mustmanually specify number of tasks ○ Want 2-4 slices for each CPU in your cluster ● Use broadcast variables for shared data and cache for data that will be reused ● Must account for the “power users” ○ Sampling heavy tailed user-interaction histories ● Need to account for the rating scale of each user! ○ Adjusted cosine similarity and pearson correlation outperform normal cosine similarity