Spark Collaborative Filtering with Alternating Least Squares

May 9, 2014
Collaborative
Filtering with Spark
Chris Johnson
@MrChrisJohnson
Friday, May 9, 14

Who am I??
•Chris Johnson
– Machine Learning guy from NYC
– Focused on music recommendations
– Formerly a graduate student at UT Austin
Friday, May 9, 14

3
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive bayes
• regression: generalized linear regression
• clustering: k-means
• decomposition: singular value decomposition (SVD), principle
component analysis (PCA
• collaborative filtering: alternating least squares (ALS)
http://spark.apache.org/docs/0.9.0/mllib-guide.html
Friday, May 9, 14

4
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive bayes
• regression: generalized linear regression
• clustering: k-means
• decomposition: singular value decomposition (SVD), principle
component analysis (PCA
• collaborative filtering: alternating least squares (ALS)
http://spark.apache.org/docs/0.9.0/mllib-guide.html
Friday, May 9, 14

Collaborative Filtering - “The Netflix Prize”
5
Friday, May 9, 14

Collaborative Filtering
6
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
Image via Erik Bernhardsson
Friday, May 9, 14

7
Collaborative Filtering at Spotify
• Discover (personalized recommendations)
• Radio
• Related Artists
• Now Playing
Friday, May 9, 14

Section name 8
Friday, May 9, 14

Explicit Matrix Factorization 9
Movies
Users
Chris
Inception
•Users explicitly rate a subset of the movie catalog
•Goal: predict how users will rate new movies
Friday, May 9, 14

• = bias for user
• = bias for item
• = regularization parameter
Explicit Matrix Factorization 10
Chris
Inception
? 3 5 ?
1 ? ? 1
2 ? 3 2
? ? ? 5
5 2 ? 4
•Approximate ratings matrix by the product of low-
dimensional user and movie matrices
•Minimize RMSE (root mean squared error)
• = user rating for movie
• = user latent factor vector
• = item latent factor vector
X Y
Friday, May 9, 14

Implicit Matrix Factorization 11
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
•Replace Stream counts with binary labels
– 1 = streamed, 0 = never streamed
•Minimize weighted RMSE (root mean squared error) using a
function of stream counts as weights
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• =i tem latent factor vector
X Y
Friday, May 9, 14

Alternating Least Squares 12
• Initialize user and item vectors to random noise
• Fix item vectors and solve for optimal user vectors
– Take the derivative of loss function with respect to user’s vector, set
equal to 0, and solve
– Results in a system of linear equations with closed form solution!
• Fix user vectors and solve for optimal item vectors
• Repeat until convergence
code: https://github.com/MrChrisJohnson/implicitMF
Friday, May 9, 14

Alternating Least Squares 13
• Note that:
• Then, we can pre-compute once per iteration
– and only contain non-zero elements for tracks that
the user streamed
– Using sparse matrix operations we can then compute each user’s
vector efficiently in time where is the number of
tracks the user streamed
Friday, May 9, 14

14
Alternating Least Squares
Friday, May 9, 14

Section name 15
Friday, May 9, 14

Scaling up Implicit Matrix Factorization
with Hadoop
16
Friday, May 9, 14

Hadoop at Spotify 2009
17
Friday, May 9, 14

Hadoop at Spotify 2014
18
700 Nodes in our London data center
Friday, May 9, 14

Implicit Matrix Factorization with Hadoop
19
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Figure via Erik Bernhardsson
Friday, May 9, 14

Implicit Matrix Factorization with Hadoop
20
One map task
Distributed
cache:
All user vectors
where u % K = x
Distributed
cache:
All item vectors
where i % L = y
Mapper Emit contributions
Map input:
tuples (u, i, count)
where
u % K = x
and
i % L = y
Reducer New vector!
Figure via Erik Bernhardsson
Friday, May 9, 14

Hadoop suffers from I/O overhead
21
IO Bottleneck
Friday, May 9, 14

Spark to the rescue!!
22
Vs
http://www.slideshare.net/Hadoop_Summit/spark-and-shark
Spark
Hadoop
Friday, May 9, 14

Section name 23
Friday, May 9, 14

First Attempt
24
ratings user vectors item vectors
node 1 node 2 node 3 node 4 node 5 node 6
• For each iteration:
– Compute YtY over item vectors and broadcast
– Join user vectors along with all ratings for that user and all item vectors for
which the user rated the item
– Sum up YtCuIY and YtCuPu and solve for optimal user vectors
Friday, May 9, 14

First Attempt
25
node 2 node 3 node 4 node 5
Friday, May 9, 14

First Attempt
26
node 2 node 3 node 4 node 5
Friday, May 9, 14

First Attempt
27
Friday, May 9, 14

First Attempt
28
•Issues:
–Unnecessarily sending multiple copies of item vector to each node
–Unnecessarily shuffling data across cluster at each iteration
–Not taking advantage of Spark’s in memory capabilities!
Friday, May 9, 14

Second Attempt
29
– Group ratings matrix into blocks, and join blocks with necessary user and
item vectors (to avoid multiple item vector copies at each node)
Friday, May 9, 14

Second Attempt
30
Friday, May 9, 14

Second Attempt
31
Friday, May 9, 14

Second Attempt
32
Friday, May 9, 14

Second Attempt
33
Friday, May 9, 14

Second Attempt
34
•Issues:
–Still Unnecessarily shuffling data across cluster at each iteration
–Still not taking advantage of Spark’s in memory capabilities!
Friday, May 9, 14

So, what are we missing?...
35
•Partitioner: Defines how the elements in a key-value pair RDD are
partitioned across the cluster.
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14

36
•partitionBy(partitioner): Partitions all elements of the same key to the
same node in the cluster, as defined by the partitioner.
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14

37
•mapPartitions(func): Similar to map, but runs separately on each
partition (block) of the RDD, so func must be of type Iterator[T] =>
Iterator[U] when running on an RDD of type T.
user vectors
partition 1
partition 2
partition 3
function()
function()
function()
Friday, May 9, 14

38
• persist(storageLevel): Set this RDD's storage level to persist (cache)
its values across operations after the first time it is computed.
user vectors
partition 1
partition 2
partition 3
Friday, May 9, 14

Third Attempt
39
• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory
• Build InLink and OutLink mappings for users and items
– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block
– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send
these vectors
– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks
– On each user block, use the InLink mapping along with the joined item vectors to update vectors
Friday, May 9, 14

Third Attempt
40
these vectors
Friday, May 9, 14

Third Attempt
41
these vectors
Friday, May 9, 14

Third Attempt
42
these vectors
Friday, May 9, 14

Third Attempt
43
these vectors
Friday, May 9, 14

Third Attempt
44
these vectors
Friday, May 9, 14

Third attempt
45
Friday, May 9, 14

Third attempt
46
Friday, May 9, 14

Third attempt
47
Friday, May 9, 14

ALS Running Times
48
Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf
Friday, May 9, 14

Section name 49
Fin
Friday, May 9, 14

Section name 50
Friday, May 9, 14

Section name 51
Friday, May 9, 14

Section name 52
Friday, May 9, 14

Section name 53
Friday, May 9, 14

Section name 54
Friday, May 9, 14

Section name 55
Friday, May 9, 14

Section name 56
Friday, May 9, 14

Spark Collaborative Filtering with Alternating Least Squares

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Spark Collaborative Filtering with Alternating Least Squares

Similar to Spark Collaborative Filtering with Alternating Least Squares (20)

Recently uploaded

Recently uploaded (20)

Spark Collaborative Filtering with Alternating Least Squares