SlideShare a Scribd company logo
1 of 56
Download to read offline
May 9, 2014
Collaborative
Filtering with Spark
Chris Johnson
@MrChrisJohnson
Friday, May 9, 14
Who am I??
•Chris Johnson
– Machine Learning guy from NYC
– Focused on music recommendations
– Formerly a graduate student at UT Austin
Friday, May 9, 14
3
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive bayes
• regression: generalized linear regression
• clustering: k-means
• decomposition: singular value decomposition (SVD), principle
component analysis (PCA
• collaborative filtering: alternating least squares (ALS)
http://spark.apache.org/docs/0.9.0/mllib-guide.html
Friday, May 9, 14
4
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive bayes
• regression: generalized linear regression
• clustering: k-means
• decomposition: singular value decomposition (SVD), principle
component analysis (PCA
• collaborative filtering: alternating least squares (ALS)
http://spark.apache.org/docs/0.9.0/mllib-guide.html
Friday, May 9, 14
Collaborative Filtering - “The Netflix Prize”
5
Friday, May 9, 14
Collaborative Filtering
6
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
Image via Erik Bernhardsson
Friday, May 9, 14
7
Collaborative Filtering at Spotify
• Discover (personalized recommendations)
• Radio
• Related Artists
• Now Playing
Friday, May 9, 14
Section name 8
Friday, May 9, 14
Explicit Matrix Factorization 9
Movies
Users
Chris
Inception
•Users explicitly rate a subset of the movie catalog
•Goal: predict how users will rate new movies
Friday, May 9, 14
• = bias for user
• = bias for item
• = regularization parameter
Explicit Matrix Factorization 10
Chris
Inception
? 3 5 ?
1 ? ? 1
2 ? 3 2
? ? ? 5
5 2 ? 4
•Approximate ratings matrix by the product of low-
dimensional user and movie matrices
•Minimize RMSE (root mean squared error)
• = user rating for movie
• = user latent factor vector
• = item latent factor vector
X Y
Friday, May 9, 14
Implicit Matrix Factorization 11
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
•Replace Stream counts with binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of stream counts as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• =i tem latent factor vector
X Y
Friday, May 9, 14
Alternating Least Squares 12
• Initialize user and item vectors to random noise
• Fix item vectors and solve for optimal user vectors
– Take the derivative of loss function with respect to user’s vector, set
equal to 0, and solve
– Results in a system of linear equations with closed form solution!
• Fix user vectors and solve for optimal item vectors
• Repeat until convergence
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
Alternating Least Squares 13
• Note that:
• Then, we can pre-compute once per iteration
– and only contain non-zero elements for tracks that
the user streamed
– Using sparse matrix operations we can then compute each user’s
vector efficiently in time where is the number of
tracks the user streamed
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
14
Alternating Least Squares
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14
Section name 15
Friday, May 9, 14
Scaling up Implicit Matrix Factorization
with Hadoop
16
Friday, May 9, 14
Hadoop at Spotify 2009
17
Friday, May 9, 14
Hadoop at Spotify 2014
18
700 Nodes in our London data center
Friday, May 9, 14
Implicit Matrix Factorization with Hadoop
19
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Figure via Erik Bernhardsson
Friday, May 9, 14
Implicit Matrix Factorization with Hadoop
20
One map task
Distributed
cache:
All user vectors
where u % K = x
Distributed
cache:
All item vectors
where i % L = y
Mapper Emit contributions
Map input:
tuples (u, i, count)
where
u % K = x
and
i % L = y
Reducer New vector!
Figure via Erik Bernhardsson
Friday, May 9, 14
Hadoop suffers from I/O overhead
21
IO Bottleneck
Friday, May 9, 14
Spark to the rescue!!
22
Vs
http://www.slideshare.net/Hadoop_Summit/spark-and-shark
Spark
Hadoop
Friday, May 9, 14
Section name 23
Friday, May 9, 14
First Attempt
24
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
First Attempt
25
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
node 2 node 3 node 4 node 5
Friday, May 9, 14
First Attempt
26
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
node 2 node 3 node 4 node 5
Friday, May 9, 14
First Attempt
27
Friday, May 9, 14
First Attempt
28
•Issues:
–Unnecessarily sending multiple copies of item vector to each node
–Unnecessarily shuffling data across cluster at each iteration
–Not taking advantage of Spark’s in memory capabilities!
Friday, May 9, 14
Second Attempt
29
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
30
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
31
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14
Second Attempt
32
Friday, May 9, 14
Second Attempt
33
Friday, May 9, 14
Second Attempt
34
•Issues:
–Still Unnecessarily shuffling data across cluster at each iteration
–Still not taking advantage of Spark’s in memory capabilities!
Friday, May 9, 14
So, what are we missing?...
35
•Partitioner: Defines how the elements in a key-value pair RDD are
partitioned across the cluster.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
So, what are we missing?...
36
•partitionBy(partitioner): Partitions all elements of the same key to the
same node in the cluster, as defined by the partitioner.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
So, what are we missing?...
37
•mapPartitions(func): Similar to map, but runs separately on each
partition (block) of the RDD, so func must be of type Iterator[T] =>
Iterator[U] when running on an RDD of type T.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
function()
function()
function()
Friday, May 9, 14
So, what are we missing?...
38
• persist(storageLevel): Set this RDD's storage level to persist (cache)
its values across operations after the first time it is computed.
node 1 node 2 node 3 node 4 node 5 node 6
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14
Third Attempt
39
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
40
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
41
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
42
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
43
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third Attempt
44
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
• For each iteration:
– Compute YtY over item vectors and broadcast
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14
Third attempt
45
Friday, May 9, 14
Third attempt
46
Friday, May 9, 14
Third attempt
47
Friday, May 9, 14
ALS Running Times
48
Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf
Friday, May 9, 14
Section name 49
Fin
Friday, May 9, 14
Section name 50
Friday, May 9, 14
Section name 51
Friday, May 9, 14
Section name 52
Friday, May 9, 14
Section name 53
Friday, May 9, 14
Section name 54
Friday, May 9, 14
Section name 55
Friday, May 9, 14
Section name 56
Friday, May 9, 14

More Related Content

What's hot

Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisitedXavier Amatriain
 
Recommendation Systems Basics
Recommendation Systems BasicsRecommendation Systems Basics
Recommendation Systems BasicsJarin Tasnim Khan
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix ScaleJustin Basilico
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
 
Recommendation system
Recommendation systemRecommendation system
Recommendation systemAkshat Thakar
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsEnrico Palumbo
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At SpotifyVidhya Murali
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Massimo Quadrana
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Recommender systems using collaborative filtering
Recommender systems using collaborative filteringRecommender systems using collaborative filtering
Recommender systems using collaborative filteringD Yogendra Rao
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
 
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation enginesGeorgian Micsa
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
 
Multi Task Learning for Recommendation Systems
Multi Task Learning for Recommendation SystemsMulti Task Learning for Recommendation Systems
Multi Task Learning for Recommendation SystemsVaibhav Singh
 
Modern Recommendation for Advanced Practitioners
Modern Recommendation for Advanced PractitionersModern Recommendation for Advanced Practitioners
Modern Recommendation for Advanced PractitionersFlavian Vasile
 

What's hot (20)

Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisited
 
Recommendation Systems Basics
Recommendation Systems BasicsRecommendation Systems Basics
Recommendation Systems Basics
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at Spotify
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender Systems
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Recommender systems using collaborative filtering
Recommender systems using collaborative filteringRecommender systems using collaborative filtering
Recommender systems using collaborative filtering
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation engines
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
 
Multi Task Learning for Recommendation Systems
Multi Task Learning for Recommendation SystemsMulti Task Learning for Recommendation Systems
Multi Task Learning for Recommendation Systems
 
Modern Recommendation for Advanced Practitioners
Modern Recommendation for Advanced PractitionersModern Recommendation for Advanced Practitioners
Modern Recommendation for Advanced Practitioners
 

Viewers also liked

Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2njit-ronbrown
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization MachinesPavel Kalaidin
 
Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Domonkos Tikk
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorizationrubyyc
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFMLiangjie Hong
 
Nonnegative Matrix Factorization
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
Nonnegative Matrix FactorizationTatsuya Yokota
 
Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsAladejubelo Oluwashina
 
آموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومآموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومfaradars
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringDKALab
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender SystemsT212
 
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemMilind Gokhale
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics
 

Viewers also liked (15)

Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2Lecture 6   lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
Lecture 6 lu factorization & determinants - section 2-5 2-7 3-1 and 3-2
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization Machines
 
Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...Neighbor methods vs matrix factorization - case studies of real-life recommen...
Neighbor methods vs matrix factorization - case studies of real-life recommen...
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorization
 
Factorization Machines with libFM
Factorization Machines with libFMFactorization Machines with libFM
Factorization Machines with libFM
 
Nonnegative Matrix Factorization
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
Nonnegative Matrix Factorization
 
Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender Systems
 
آموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دومآموزش محاسبات عددی - بخش دوم
آموزش محاسبات عددی - بخش دوم
 
Introduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative FilteringIntroduction to Matrix Factorization Methods Collaborative Filtering
Introduction to Matrix Factorization Methods Collaborative Filtering
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 

Similar to Spark Collaborative Filtering with Alternating Least Squares

CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyVidhya Murali
 
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Sc Huang
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
IntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfIntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfAlphaIssaghaDiallo
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part IIQuantUniversity
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsChris Johnson
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzikitzik cohen
 
Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network	Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network Cigdem Aslay
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systemsfuchaoqun
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systemsNAVER Engineering
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptxJasonTuran2
 
Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 

Similar to Spark Collaborative Filtering with Alternating Least Squares (20)

CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
Random Walk by User Trust and Temporal Issues toward Sparsity Problem in Soci...
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
IntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfIntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdf
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
 
Final Presentation - Edan&Itzik
Final Presentation - Edan&ItzikFinal Presentation - Edan&Itzik
Final Presentation - Edan&Itzik
 
Project Matsu
Project MatsuProject Matsu
Project Matsu
 
Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network	Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systems
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 

Recently uploaded

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 

Recently uploaded (20)

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 

Spark Collaborative Filtering with Alternating Least Squares

  • 1. May 9, 2014 Collaborative Filtering with Spark Chris Johnson @MrChrisJohnson Friday, May 9, 14
  • 2. Who am I?? •Chris Johnson – Machine Learning guy from NYC – Focused on music recommendations – Formerly a graduate student at UT Austin Friday, May 9, 14
  • 3. 3 What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive bayes • regression: generalized linear regression • clustering: k-means • decomposition: singular value decomposition (SVD), principle component analysis (PCA • collaborative filtering: alternating least squares (ALS) http://spark.apache.org/docs/0.9.0/mllib-guide.html Friday, May 9, 14
  • 4. 4 What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive bayes • regression: generalized linear regression • clustering: k-means • decomposition: singular value decomposition (SVD), principle component analysis (PCA • collaborative filtering: alternating least squares (ALS) http://spark.apache.org/docs/0.9.0/mllib-guide.html Friday, May 9, 14
  • 5. Collaborative Filtering - “The Netflix Prize” 5 Friday, May 9, 14
  • 6. Collaborative Filtering 6 Hey, I like tracks P, Q, R, S! Well, I like tracks Q, R, S, T! Then you should check out track P! Nice! Btw try track T! Image via Erik Bernhardsson Friday, May 9, 14
  • 7. 7 Collaborative Filtering at Spotify • Discover (personalized recommendations) • Radio • Related Artists • Now Playing Friday, May 9, 14
  • 9. Explicit Matrix Factorization 9 Movies Users Chris Inception •Users explicitly rate a subset of the movie catalog •Goal: predict how users will rate new movies Friday, May 9, 14
  • 10. • = bias for user • = bias for item • = regularization parameter Explicit Matrix Factorization 10 Chris Inception ? 3 5 ? 1 ? ? 1 2 ? 3 2 ? ? ? 5 5 2 ? 4 •Approximate ratings matrix by the product of low- dimensional user and movie matrices •Minimize RMSE (root mean squared error) • = user rating for movie • = user latent factor vector • = item latent factor vector X Y Friday, May 9, 14
  • 11. Implicit Matrix Factorization 11 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Replace Stream counts with binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of stream counts as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X Y Friday, May 9, 14
  • 12. Alternating Least Squares 12 • Initialize user and item vectors to random noise • Fix item vectors and solve for optimal user vectors – Take the derivative of loss function with respect to user’s vector, set equal to 0, and solve – Results in a system of linear equations with closed form solution! • Fix user vectors and solve for optimal item vectors • Repeat until convergence code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 13. Alternating Least Squares 13 • Note that: • Then, we can pre-compute once per iteration – and only contain non-zero elements for tracks that the user streamed – Using sparse matrix operations we can then compute each user’s vector efficiently in time where is the number of tracks the user streamed code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 14. 14 Alternating Least Squares code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
  • 16. Scaling up Implicit Matrix Factorization with Hadoop 16 Friday, May 9, 14
  • 17. Hadoop at Spotify 2009 17 Friday, May 9, 14
  • 18. Hadoop at Spotify 2014 18 700 Nodes in our London data center Friday, May 9, 14
  • 19. Implicit Matrix Factorization with Hadoop 19 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1 Figure via Erik Bernhardsson Friday, May 9, 14
  • 20. Implicit Matrix Factorization with Hadoop 20 One map task Distributed cache: All user vectors where u % K = x Distributed cache: All item vectors where i % L = y Mapper Emit contributions Map input: tuples (u, i, count) where u % K = x and i % L = y Reducer New vector! Figure via Erik Bernhardsson Friday, May 9, 14
  • 21. Hadoop suffers from I/O overhead 21 IO Bottleneck Friday, May 9, 14
  • 22. Spark to the rescue!! 22 Vs http://www.slideshare.net/Hadoop_Summit/spark-and-shark Spark Hadoop Friday, May 9, 14
  • 24. First Attempt 24 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 25. First Attempt 25 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors node 2 node 3 node 4 node 5 Friday, May 9, 14
  • 26. First Attempt 26 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors node 2 node 3 node 4 node 5 Friday, May 9, 14
  • 28. First Attempt 28 •Issues: –Unnecessarily sending multiple copies of item vector to each node –Unnecessarily shuffling data across cluster at each iteration –Not taking advantage of Spark’s in memory capabilities! Friday, May 9, 14
  • 29. Second Attempt 29 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 30. Second Attempt 30 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 31. Second Attempt 31 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
  • 34. Second Attempt 34 •Issues: –Still Unnecessarily shuffling data across cluster at each iteration –Still not taking advantage of Spark’s in memory capabilities! Friday, May 9, 14
  • 35. So, what are we missing?... 35 •Partitioner: Defines how the elements in a key-value pair RDD are partitioned across the cluster. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 36. So, what are we missing?... 36 •partitionBy(partitioner): Partitions all elements of the same key to the same node in the cluster, as defined by the partitioner. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 37. So, what are we missing?... 37 •mapPartitions(func): Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator[T] => Iterator[U] when running on an RDD of type T. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 function() function() function() Friday, May 9, 14
  • 38. So, what are we missing?... 38 • persist(storageLevel): Set this RDD's storage level to persist (cache) its values across operations after the first time it is computed. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
  • 39. Third Attempt 39 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 40. Third Attempt 40 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 41. Third Attempt 41 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 42. Third Attempt 42 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 43. Third Attempt 43 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 44. Third Attempt 44 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
  • 48. ALS Running Times 48 Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf Friday, May 9, 14