Your SlideShare is downloading.
×

- 1. May 9, 2014 Collaborative Filtering with Spark Chris Johnson @MrChrisJohnson Friday, May 9, 14
- 2. Who am I?? •Chris Johnson – Machine Learning guy from NYC – Focused on music recommendations – Formerly a graduate student at UT Austin Friday, May 9, 14
- 3. 3 What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive bayes • regression: generalized linear regression • clustering: k-means • decomposition: singular value decomposition (SVD), principle component analysis (PCA • collaborative filtering: alternating least squares (ALS) http://spark.apache.org/docs/0.9.0/mllib-guide.html Friday, May 9, 14
- 4. 4 What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive bayes • regression: generalized linear regression • clustering: k-means • decomposition: singular value decomposition (SVD), principle component analysis (PCA • collaborative filtering: alternating least squares (ALS) http://spark.apache.org/docs/0.9.0/mllib-guide.html Friday, May 9, 14
- 5. Collaborative Filtering - “The Netflix Prize” 5 Friday, May 9, 14
- 6. Collaborative Filtering 6 Hey, I like tracks P, Q, R, S! Well, I like tracks Q, R, S, T! Then you should check out track P! Nice! Btw try track T! Image via Erik Bernhardsson Friday, May 9, 14
- 7. 7 Collaborative Filtering at Spotify • Discover (personalized recommendations) • Radio • Related Artists • Now Playing Friday, May 9, 14
- 8. Section name 8 Friday, May 9, 14
- 9. Explicit Matrix Factorization 9 Movies Users Chris Inception •Users explicitly rate a subset of the movie catalog •Goal: predict how users will rate new movies Friday, May 9, 14
- 10. • = bias for user • = bias for item • = regularization parameter Explicit Matrix Factorization 10 Chris Inception ? 3 5 ? 1 ? ? 1 2 ? 3 2 ? ? ? 5 5 2 ? 4 •Approximate ratings matrix by the product of low- dimensional user and movie matrices •Minimize RMSE (root mean squared error) • = user rating for movie • = user latent factor vector • = item latent factor vector X Y Friday, May 9, 14
- 11. Implicit Matrix Factorization 11 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 •Replace Stream counts with binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a function of stream counts as weights • = bias for user • = bias for item • = regularization parameter • = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector X Y Friday, May 9, 14
- 12. Alternating Least Squares 12 • Initialize user and item vectors to random noise • Fix item vectors and solve for optimal user vectors – Take the derivative of loss function with respect to user’s vector, set equal to 0, and solve – Results in a system of linear equations with closed form solution! • Fix user vectors and solve for optimal item vectors • Repeat until convergence code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
- 13. Alternating Least Squares 13 • Note that: • Then, we can pre-compute once per iteration – and only contain non-zero elements for tracks that the user streamed – Using sparse matrix operations we can then compute each user’s vector efficiently in time where is the number of tracks the user streamed code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
- 14. 14 Alternating Least Squares code: https://github.com/MrChrisJohnson/implicitMF Friday, May 9, 14
- 15. Section name 15 Friday, May 9, 14
- 16. Scaling up Implicit Matrix Factorization with Hadoop 16 Friday, May 9, 14
- 17. Hadoop at Spotify 2009 17 Friday, May 9, 14
- 18. Hadoop at Spotify 2014 18 700 Nodes in our London data center Friday, May 9, 14
- 19. Implicit Matrix Factorization with Hadoop 19 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1 Figure via Erik Bernhardsson Friday, May 9, 14
- 20. Implicit Matrix Factorization with Hadoop 20 One map task Distributed cache: All user vectors where u % K = x Distributed cache: All item vectors where i % L = y Mapper Emit contributions Map input: tuples (u, i, count) where u % K = x and i % L = y Reducer New vector! Figure via Erik Bernhardsson Friday, May 9, 14
- 21. Hadoop suffers from I/O overhead 21 IO Bottleneck Friday, May 9, 14
- 22. Spark to the rescue!! 22 Vs http://www.slideshare.net/Hadoop_Summit/spark-and-shark Spark Hadoop Friday, May 9, 14
- 23. Section name 23 Friday, May 9, 14
- 24. First Attempt 24 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
- 25. First Attempt 25 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors node 2 node 3 node 4 node 5 Friday, May 9, 14
- 26. First Attempt 26 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for which the user rated the item – Sum up YtCuIY and YtCuPu and solve for optimal user vectors node 2 node 3 node 4 node 5 Friday, May 9, 14
- 27. First Attempt 27 Friday, May 9, 14
- 28. First Attempt 28 •Issues: –Unnecessarily sending multiple copies of item vector to each node –Unnecessarily shuffling data across cluster at each iteration –Not taking advantage of Spark’s in memory capabilities! Friday, May 9, 14
- 29. Second Attempt 29 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
- 30. Second Attempt 30 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
- 31. Second Attempt 31 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • For each iteration: – Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and item vectors (to avoid multiple item vector copies at each node) – Sum up YtCuIY and YtCuPu and solve for optimal user vectors Friday, May 9, 14
- 32. Second Attempt 32 Friday, May 9, 14
- 33. Second Attempt 33 Friday, May 9, 14
- 34. Second Attempt 34 •Issues: –Still Unnecessarily shuffling data across cluster at each iteration –Still not taking advantage of Spark’s in memory capabilities! Friday, May 9, 14
- 35. So, what are we missing?... 35 •Partitioner: Defines how the elements in a key-value pair RDD are partitioned across the cluster. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
- 36. So, what are we missing?... 36 •partitionBy(partitioner): Partitions all elements of the same key to the same node in the cluster, as defined by the partitioner. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
- 37. So, what are we missing?... 37 •mapPartitions(func): Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator[T] => Iterator[U] when running on an RDD of type T. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 function() function() function() Friday, May 9, 14
- 38. So, what are we missing?... 38 • persist(storageLevel): Set this RDD's storage level to persist (cache) its values across operations after the first time it is computed. node 1 node 2 node 3 node 4 node 5 node 6 user vectors partition 1 partition 2 partition 3 Friday, May 9, 14
- 39. Third Attempt 39 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
- 40. Third Attempt 40 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
- 41. Third Attempt 41 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
- 42. Third Attempt 42 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
- 43. Third Attempt 43 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
- 44. Third Attempt 44 ratings user vectors item vectors node 1 node 2 node 3 node 4 node 5 node 6 • Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory • Build InLink and OutLink mappings for users and items – InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block – OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send these vectors • For each iteration: – Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks – On each user block, use the InLink mapping along with the joined item vectors to update vectors Friday, May 9, 14
- 45. Third attempt 45 Friday, May 9, 14
- 46. Third attempt 46 Friday, May 9, 14
- 47. Third attempt 47 Friday, May 9, 14
- 48. ALS Running Times 48 Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf Friday, May 9, 14
- 49. Section name 49 Fin Friday, May 9, 14
- 50. Section name 50 Friday, May 9, 14
- 51. Section name 51 Friday, May 9, 14
- 52. Section name 52 Friday, May 9, 14
- 53. Section name 53 Friday, May 9, 14
- 54. Section name 54 Friday, May 9, 14
- 55. Section name 55 Friday, May 9, 14
- 56. Section name 56 Friday, May 9, 14